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INTRODUCTION 


The  PISCES  program  developed  by  Stanford  University  has  become  a  world-wide  resource  for 
both  academic  research  and  industrial  applications  in  device  modeling.  The  PISCES  code  sup¬ 
ports  two-dimensional  (2D)  analysis  of  electronic  devices  for  dc,  ac  and  time- transient  operating 
conditions.  Over  more  than  a  decade  there  have  been  three  major  new  releases  of  the  code  (ver¬ 
sions  2, 2B  and  2ET)  as  well  as  commercial  vendors  that  now  support  proprietary  versions  based 
on  the  original  Stanford  code.  Over  the  past  half-dozen  years  the  Semiconductor  Research  Corpo¬ 
ration  (SRC)  has  become  a  partner  with  ARO  in  the  research  support  for  PISCES  developments, 
their  major  focus  being  on  the  developments  that  support  silicon-based  technology.  There  have 
also  been  AREA  supported  contracts  to  create  parallel  computing  versions  of  both  2D  and  3D 
device  analysis  based  on  PISCES  and  a  3D  prototype  code  STRIDE,  also  initially  developed 
under  ARO  support.  The  focus  of  the  ARO  supported  research  activities  has  been  in  the  areas  of 
multi-material  modeling  for  advanced  physical  models  and  in  the  underlying  techniques  and  algo¬ 
rithms  that  support  efficient  computation. 


OBJECTIVES 

The  goal  of  this  research  is  targeted  at  developing  new  methods  and  prototype  design  tools  to  sup¬ 
port  modeling  of  advanced  multi-material  devices  that  are  required  for  high-speed  circuits.  The 
simulation  of  such  devices  has  become  an  essential  stage  in  the  design  and  manufacturing  of  new 
components.  With  the  rapid  evolution  of  synthesized  heterojunction  materials,  it  is  of  major  bene¬ 
fit  to  have  models  to  provide  physical  undeipinning  to  the  very  expensive  and  empirical  technol¬ 
ogy  development  cycle. 


In  order  to  support  the  overall  modeling  goal,  the  research  objectives  center  on  developing  new, 
flexible  simulation  capabilities  and  on  providing  advanced  physical  models  in  order  to  support  the 
revolutionary  developments  in  device  technology.  From  a  computational  perspective,  the  devel¬ 
opment  of  framework  utilities  to  reduce  the  user  burden  and  the  exploitation  of  parallel  computers 
to  make  the  simulations  more  efficient  are  key  areas  of  research  activity  in  this  project.  A  major 
focus  in  this  research  has  been  to  develop  physical  models  and  numerical  techniques  that  support 
analysis  of  compound  and  heterojunction  material  devices. 

RESEARCH  RESULTS 

The  scope  of  the  efforts  supported  by  this  ARO  is  centered  on  models,  algorithms  and  prototype 
2D  design  tools  that  support  multi-layer  device  simulation.  Nonetheless,  over  the  course  of  previ¬ 
ous  ARO  contracts  that  supported  PISCES  and  related  algorithmic  developments,  key  concepts 
have  been  developed  that  have  now  matured  and  are  bearing  fruit  in  the  context  of  other  projects. 
As  a  result,  this  report  includes  both  the  immediate  outcome  of  the  presently  funded  work  as  well 
as  the  spin-off  results  that  had  their  foundations  in  earlier  ARO  supported  contributions. 

The  report  is  organized  into  three  main  sub-sections  and  a  set  of  supporting  appendices.  First,  the 
key  contributions  to  physical  modeling  and  the  supporting  formulations  are  presented.  Second, 
there  are  a  number  of  technology  applications  where  the  modeling  activities  have  been  bench- 
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marked  and  the  industrial  benefits  can  be  measured.  Third,  the  underlying  framework  and  compu¬ 
tational  aspects  of  the  modeling  work  are  presented.  Here  too  there  are  a  number  of  applications 
resulting  from  the  growing  industrial  use  of  the  tools  and  the  national  project  in  High  Perfor¬ 
mance  Computing  (HPC).  Since  a  majority  of  the  research  supported  by  this  contract  has  been 
published  at  conferences  and  in  other  archival  journals,  the  use  of  appendices  that  draw  from  this 
body  of  work  will  be  the  main  approach  for  summarizing  technical  details.  Finally,  it  should  be 
noted  that  other  supporting  materials  including  both  software  and  Ph.D.  thesis  documentation  is 
available  directly  from  Stanford  by  means  of  internet  access  using  the  following  world-wide  web 
URL  address  at  http://www-ee.stanford.edu/tcad.html. 

1.  ADVANCED  DEVICE  MODELS— There  have  been  three  major  developments  in  transport 
modeling  of  multi-layer  materials:  a)  formulation  and  implementation  of  eneigy  transport  (ET) 
models  that  include  fully  coupled  carrier  and  lattice  temperatures,  b)  implementation  of  hetero¬ 
junction  interfaces  in  support  of  a  variety  of  material  systems  and  c)  complete  implementation 
and  testing  of  deep  level  trapping  for  GaAs  devices.  The  highlights  of  these  activities  are  now 
reviewed  and  further  details  are  found  in  Appendix  1. 

la.  Energy  Transport  Models:  Over  this  contract  period  we  have  developed  both  the  basic  con¬ 
cepts  of  the  energy  transport  (ET)  model  [1]  and  a  fully  coupled  version,  called  Dual  ET  (i.e. 
DUET  or  2ET)[2],  that  models  both  carrier  and  lattice  temperatures  in  a  coupled  and  consistent 
manner.  In  coordination  with  SRC  that  supports  the  silicon  applications,  this  research  has  not  only 
developed  the  basic  formulation  but  also  mounted  an  intense  effort  to  calibrate  and  check  the 
accuracy  of  the  physical  approach.  Of  special  interest  is  the  calibration  of  mobility  models  [3]  as 
well  as  impact  ionization  [4],  Quantitative  agreement  in  the  modeling  with  experimental  substrate 
current  results  for  submicron  silicon  devices  is  most  encouraging.  Moreover,  deep  submicron 
behavior  of  0.12  micron  SOI  devices  measured  by  Professor  Hu’s  group  at  U.C.  Berkeley  shows  a 
good  match  with  the  simulated  results  obtained  using  the  DUET  model  [4], 

lb.  Heterojunction  Interfaces:  In  order  to  support  the  modeling  of  high-speed  and  optoelectronic 
devices  that  utilize  heterojunctions,  major  enhancements  to  the  model  formulation  and  implemen¬ 
tation  for  device  simulation  were  required.  There  are  a  variety  of  changes  related  to  materials  and 
interface  properties,  including  consideration  of  energy  band  offsets.  Owing  to  spatial  variation  of 
composition  there  is  also  a  need  to  interpolate  data  across  the  various  regions.  Details  of  the  mod¬ 
els  supported  are  given  in  the  PISCES -2ET  manual  [5]. 

lc.  Deep  Level  Trapping:  The  role  of  deep  level  traps  on  changing  both  the  dc  and  ac  behavior  of 
GaAs  devices  has  been  recognized  for  many  years  but  a  quantitative  model  of  the  effect  has  been 
lacking.  Under  ARPA  support  of  an  experimental  effort,  including  extensive  process  modeling 
and  fabrication  of  test  structures,  the  ARO  efforts  in  device  analysis  have  provided  an  essential 
compliment.  Based  on  implementation  of  a  general  rate  equation  for  deep  level  impurities,  includ¬ 
ing  the  ac  and  time  transient  electronic  trapping  behavior,  it  is  now  possible  to  achieve  quantita¬ 
tive  agreement  of  side  gating  effects  in  bulk  GaAs  [6],  Two  major  mechanisms  are  revealed  to  be 
responsible  for  sidegating:  1)  the  formation  of  a  Gunn  domain  at  the  channel/substrate  interface 
(requiring  a  negative  differential  mobility  model)  and  2)  substrate  conduction  between  a  parasitic 
Schottky  contact  and  the  sidegate  due  to  carrier  injection.  Both  effects  are  modeled  quantitatively 
for  the  first  time  in  this  work. 
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2.  TECHNOLOGY  DRIVERS— By  means  of  a  number  of  specific  applications,  the  advanced 
modeling  capabilities  developed  in  this  research  have  been  tested.  The  support  for  the  PISCES 
development  efforts  have  come  from  several  sources.  For  the  sake  of  completeness  in  this  report, 
that  diversity  of  applications  will  also  be  reflected.  A  representative  set  of  technologies,  both  sili¬ 
con-based  as  well  as  those  using  compound  materials,  will  be  used  for  illustration.  In  the  area  of 
silicon  technologies,  co-sponsored  by  SRC,  the  application  to  power  devices  is  given  as  the  pri¬ 
mary  example.  For  compound  materials,  two  examples  are  reviewed,  the  first  one  involves  GaAs 
MESFETs  with  co-sponsorship  coming  from  ARPA.  The  second  example  is  solely  sponsored  by 
ARO  and  considers  AlGaAs  used  for  opto-electronic  (OEIC)  devices.  As  done  in  the  previous 
section.  Appendix  2  is  used  to  provide  further  details  on  each  of  these  application  domains. 

2a.  Silicon-based  Applications:  The  previous  section  has  considered  advances  in  physical  models 
developed  in  this  research.  The  research  and  development  of  the  DUET  model  [1]  for  considering 
the  coupled  temperature  solution  is  a  major  outcome  of  this  project.  A  large  number  of  the  appli¬ 
cations  related  to  silicon  technology  come  from  the  commercial  semiconductor  (SRC)  areas.  The 
calibration  of  substrate  current  models  and  deep  submicron  SOI  characteristics  have  already  been 
cited.  The  case  of  SOI  modeling  has  broad  implications  for  both  high  speed  communications  and 
for  radiation  resistant  applications.  Another  application  currently  under  investigation  is  electro¬ 
static  discharge  (ESD).  Preliminary  results  show  that  the  coupling  of  transient  current  pulses  and 
joule  heating  is  essential.  It  is  expected  that  follow-on  work  in  this  area  under  SRC  support  will 
more  completely  demonstrate  the  full  impact  of  the  DUET  model  representing  physical  phenom¬ 
ena  involved  in  ESD. 

The  primary  silicon  application  reviewed  here  is  that  of  modeling  power  devices.  While  Stanford 
has  had  only  small,  primarily  industrially  funded,  contracts  in  this  area,  there  are  a  sustained  set  of 
contributions  over  more  than  a  decade.  Of  special  interest  here  is  the  need  to  support  a  variety  of 
physical  models  and  complex  geometries  for  power  devices.  Specifically,  the  need  for  mobility 
models  that  capture  high  field  and  carrier-carrier  scattering  mechanics  as  well  as  various  other 
surface  phenomena  that  occur  in  power  MOS  devices  are  critical  [7],  The  issues  of  complex 
geometry  are  also  important  factors  for  power  devices.  The  use  of  trench  isolation  and  cylindrical 
geometries  are  both  aspects  of  PISCES  enhancements  that  have  received  significant  attention 
owing  to  unique  factors  associated  with  these  devices.  There  are  many  trends  in  power  devices, 
not  covered  specifically  in  this  report  which  are  expected  to  leverage  from  the  newest  version  of 
PISCES.  For  example,  the  growing  interest  in  SiC  as  well  as  other  compound  materials  is  one  key 
trend.  The  analysis  capabilities,  coupled  carrier  and  lattice  temperatures,  have  already  been  men¬ 
tioned  in  the  context  of  SOI  devices. 

2b.  GaAs  Device  Simulation:  The  modeling  of  GaAs  technology  continues  to  be  a  significant 
application  area  owing  to  the  importance  of  circuits  and  devices  for  high  speed  communications. 
Under  ARPA  support,  Stanford  has  developed  versions  of  SUPREM  to  model  GaAs  fabrication 
processes  in  both  ID  (version  3.5)  and  2D  (version  4GS).  The  device  modeling  of  GaAs  has  been 
supported  through  both  the  ARPA  contract  and  the  ARO  efforts  reported  here  [5],  There  are  sev¬ 
eral  aspects  of  the  GaAs  device  modeling  that  are  of  special  interest  and  importance:  1)  the  mobil¬ 
ity  model  [3],  2)  the  trapping  effects  [6],  and  3)  the  ac  high  frequency  behavior [8] . 
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Modeling  of  the  negative  differential  mobility  (NDM)  and  trapping  have  been  critical  in  provid¬ 
ing  quantitative  prediction  of  the  sidegating  effects  observed  in  GaAs  [6].  The  trapping  of  carri¬ 
ers,  combined  with  the  surface  and  bulk  boundary  conditions  result  in  the  creation  of  a  stationary 
Gunn  domain.  In  contrast  to  earlier  models  that  suggested  other  coupling  between  physical  mod¬ 
els  such  as  trapping  and  impact  ionization,  the  present  simulations  have  achieved  quantitative 
agreement  with  experimental  studies.  Further  details  are  summarized  in  Appendix  1  and  in  the 
thesis  contribution  of  Liu  [9]. 

Another  key  aspect  of  GaAs  MESFET  modeling  is  the  correct  prediction  of  high  frequency 
behavior.  In  this  work,  two  key  factors  have  contributed  to  the  successful  matching  of  experimen¬ 
tal  results  up  to  the  frequency  regime  of  20  GHz  [8].  First,  the  development  of  robust  numerical 
techniques  to  be  considered  in  the  following  section,  now  allow  ac  analysis  to  frequencies  beyond 
the  cutoff  of  the  device.  Since  measurement  accuracy  and  resolution  become  questionable  in  this 
regime,  having  an  accurate  modeling  capability  is  especially  important.  Second,  the  parasitics 
show  critical  effects  at  the  highest  frequencies.  For  example,  the  package  parasitic  inductance  is 
observed  to  play  a  major  role  in  these  devices  above  5  GHz.  The  development  of  both  geometry 
modeling  and  mixed-mode  simulation  support  for  the  distributed  device  effects  such  as  GaAs 
FET  high  frequency  modeling  have  provided  major  benefits  as  demonstrated  in  this  project. 

2c.  Optoelectronic  (OEIC)  Devices:  The  simulation  and  modeling  of  OEIC  device  effects  is  a 
final  application  area  stressed  in  the  research  supported  through  this  ARO  contract.  In  collabora¬ 
tion  with  Hewlett-Packard  and  Los  Alamos  National  Labs  (LANL)  in  their  joint  CRADA  for 
OEIC  modeling,  the  PISCES  code  has  been  extended  and  enhanced  for  laser  modeling  as  well  as 
for  simpler  LED  structures  being  developed  by  the  optical  components  division  (OCD).  Several 
aspects  of  the  cylindrical  geometry  and  mixed-mode  analysis  capabilities  discussed  above  are 
also  critical  for  OEIC  modeling  [10],  In  the  case  of  data  communications  transmitter  circuits,  the 
complexity  of  electrical  boundary  conditions  on  final  optical  output  is  especially  important. 
Appendix  2  shows  in  some  detail  the  interplay  of  geometry  modeling  for  device  analysis  specifi¬ 
cation  and  the  unique  capabilities  provided  through  mixed-mode  modeling  of  OEICs.  The  key 
point  to  be  reiterated  here  is  the  fact  that  extracted  SPICE  models  for  the  optoelectronic  side  of 
these  circuits  is  not  only  difficult  but  provides  no  means  to  directly  link  the  physical  effects  inside 
the  devices  with  the  observed  behavior.  The  TCAD  approach  being  developed  in  this  work  takes  a 
major  step  in  the  direction  of  solving  such  problems. 

3.  COMPUTATIONAL  METHODS— Numerical  techniques  and  advanced  algorithmic 
approaches  to  device  modeling  have  been  the  final  area  of  major  achievement  during  this  contract. 
Over  nearly  a  decade,  under  ARO  support,  a  number  of  advanced  algorithms  for  parallel  comput¬ 
ing  have  been  developed  [11].  As  a  result  of  these  key  algorithmic  demonstrations,  both  the  State 
of  California  through  its  Competitive  Technology  (CompTech)  office  and  ARPA's  equipment 
grant  (siting  of  a  32  node  Intel  iPSC/860  at  Stanford)  provided  the  necessary  leverage  to  carry  for¬ 
ward  the  initial  ARO  work  into  full-scale  prototype  demonstrations.  Namely,  the  development  of 
PISCES-MP  (multi-processor  version)  [12]  and  the  benchmarking  of  STRIDE  [13]  on  the  experi¬ 
mental  520  node  “Delta”  machine  sited  at  CalTech  are  key  highlights  of  the  CompTech/ ARPA 
supported  follow-on  work  which  derives  from  the  pioneering  efforts  at  Stanford  that  were  initially 
supported  by  ARO.  Further  details  of  these  efforts  as  well  as  the  most  recent  numerical  results  in 
ac  and  mixed-mode  analysis  are  summarized  in  this  subsection  as  well  as  in  Appendix  3. 
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3a.  Multi-Processor  PISCES:  The  original  demonstrations  of  the  distributed  multi-frontal  (DMF) 
algorithm  developed  by  Lucas  [14]  utilized  matrix  stencils  created  by  PISCES.  Nonetheless,  to 
achieve  the  full-scale  parallelization  of  the  code  many  other  changes  had  to  be  implemented.  The 
Aladdin-CAD  (CompTech)  project  mentioned  above  provided  the  motivation  and  partial  support 
for  this  effort.  Demonstration  of  PISCES-MP  on  the  Intel  iPSC  platform  showed  excellent  speed 
up  factors  up  to  8  processors  but  node  memory  and  interprocessor  communications  limits  of  that 
machine  reduced  the  efficiency  for  further  scaling  [11].  The  follow-on  efforts  under  the  HPC  ini¬ 
tiative  and  supported  by  AREA  have  demonstrated  further  scaling  of  PISCES-MP  on  a  TMC  CM- 
5  Machine  with  32  nodes  [15].  Now  in  progress  is  an  effort  to  move  PISCES-MP  to  the  IBM  SP / 
1  and  SP/2  machines  which  will  lead  to  understanding  of  both  problem  size  and  algorithmic  limits 
of  the  direct  solver  technology  used.  The  sustained  efforts  on  the  PISCES  project,  including  the 
development  of  a  multi-processor  (MP)  version,  provides  a  major  demonstration  with  long  term 
industrial  payoff  potential. 

3b.  Parallel  Iterative  Solvers:  The  interest  in  and  needs  for  fully  3D  TCAD  applications  has 
grown  steadily  over  the  past  decade.  The  previous  ARO  contract  resulted  in  the  demonstration  of 
the  STRIDE  [13]  prototype  3D  device  solver.  Under  CompTech  and  now  ARPA  support  that  pro¬ 
totype  has  been  used  to  generate  3D  bipolar  benchmark  examples  with  4.9  million  grid  that 
achieve  convergent  results  over  the  entire  operating  region  [16].  In  fact,  these  results  were  hon¬ 
ored  in  Intel  Grand  Challenges  Computing  benchmarks  in  May,  1992.  This  work  has  also 
received  national  recognition  at  the  HPC  Symposium  (March  15, 1994)  and  the  results  are  shown 
in  the  HPC  “Blue  Book”  literature  illustrating  applications  where  HPC  technology  is  expected  to 
have  major  payoffs.  Again,  it  should  be  emphasized  that  the  core  DMF  algorithm  and  initial  dem¬ 
onstration  of  STRIDE  came  under  previous  ARO-sponsored  projects.  DMF  solver  technology  as 
well  as  parallelization  of  preconditioners  required  for  iterative  solutions  have  resulted  in  excellent 
overall  efficiency  and  scalability  of  computation  time  with  processors  and  problem  size.  In  addi¬ 
tion  to  the  very  exciting  benchmark  results  discussed  above,  there  has  also  been  more  detailed 
theoretical  work  in  understanding  the  overall  capabilities  of  pre-conditioning  for  block  iterative 
solvers  [17].  This  aspect  of  the  work  is  quite  basic  and  has  been  primarily  supported  by  ARO. 

3c.  Algorithms  for  ac  Analysis:  As  noted  in  the  previous  section  on  use  of  ac  analysis  in  the 
design  of  GaAs  MESFETs,  such  capabilities  are  of  major  value  and  benefit  for  high  frequency 
applications.  From  an  algorithmic  point  of  view,  the  development  of  robust  iterative  techniques 
for  ac  analysis  has  been  a  major  challenge.  First,  previous  methods  that  exploit  direct  matrix  solu¬ 
tion  techniques  become  highly  memory  and  CPU  intensive  [18].  Second,  the  numerical  nature  of 
the  matrix  changes  with  frequency  and  robustness  becomes  a  serious  challenge  at  frequencies 
above  the  device  cut-off.  The  key  features  of  the  new  algorithm  used  in  this  work  is  summarized 
in  Appendix  2  and  further  details  are  presented  in  the  PISCES  2ET  documentation^].  By  means 
of  careful  control  of  the  matrix  conditioning,  based  on  operating  frequency  and  other  information 
about  device  operating  conditions,  excellent  convergence  rates  and  overall  numerical  stability 
have  been  achieved  through  research  supported  in  this  contract. 

3d.  Mixed-Mode  (PISCES-SPICE)  Simulations:  There  has  been  a  number  of  developments  in 
the  area  of  mixed-mode  simulation,  as  well  as  sustained  motivation  of  the  need  for  such  model¬ 
ing.  In  the  previous  section,  two  important  applications  for  mixed-mode  have  been  illustrated. 
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Namely,  both  power  device  and  OEIC  modeling  are  cases  where  the  interactions  of  complex 
device  physics  and  circuit  boundary  conditions  lead  to  requirements  that  mandate  the  use  of 
mixed-mode  techniques. 

During  this  contract,  our  efforts  have  focused  on  creating  a  loosely  coupled  mixed  mode  capabil¬ 
ity  that  utilizes  standard  versions  of  PISCES  and  SPICE  with  modular  changes  to  each  of  the 
respective  codes.  This  has  required  not  only  careful  understanding  of  the  internal  algorithms  used 
at  each  level  but  how  to  interface  and  control  the  coupled  operation  of  the  two  tools.  The  bound¬ 
ary  conditions  and  developing  a  consistent  representation,  especially  on  the  device  analysis  side 
has  been  a  major  challenge.  On  the  circuit  analysis  side,  the  use  of  SPICE  3  has  been  of  major 
benefit  owing  to  the  improved  code  structure  and  documentation.  Finally,  there  has  been  an  effort 
to  parallelize  the  mixed-mode  capabilities  in  the  context  of  PVM,  running  on  both  networked 
workstations  and  the  IBM  SP/1.  The  potential  for  major  speeding  factors  has  been  demonstrated 
based  on  results  for  ring  oscillator  circuits  and  six  transistor  SRAMS. 

The  mixed-mode  aspect  of  this  project  received  AASERT  funding  beyond  the  present  ARO  con¬ 
tract  and  this  research  will  be  continued  in  support  of  OEIC  applications.  In  fact,  the  conclusions 
of  this  report  outline  both  the  major  findings  of  this  project  and  the  key  opportunities  for  further 
research,  including  technology  transfer  to  industry. 

CONCLUSIONS  AND  RECOMMENDATIONS 

The  development  of  physical  models  and  algorithms  for  multi-layer,  multi-material  device  simu¬ 
lation  has  resulted  in  major  accomplishments  in  all  aspects  of  the  project.  Moreover,  there  has 
been  ongoing  as  well  as  new  opportunities  for  technology  transfer  and  industrial  impact.  The 
highlights  of  this  work  include: 

•  Implementation  of  the  2ET  model  in  PISCES  and  demonstration  for  both  advanced  sil¬ 
icon  and  compound  material  heterojunction  devices. 

•  Application  of  the  advanced  PISCES  2ET  modeling  capabilities  to: 

-  GaAs  MESFET  sidegating  with  quantitative  experimental  confirmation, 

—  Laser  diode  modeling  in  support  of  Hewlett-Packard  applications. 

•  Algorithmic  advances  in  support  of  ac  high  frequency  analysis  with  mixed-mode 
boundary  conditions. 

In  addition  to  the  direct  output  of  this  contract,  there  have  been  a  set  of  spin-off  results  from  pre¬ 
vious  ARO  contracts  that  deserve  special  comment.  In  the  area  of  silicon  devices,  there  are  ongo¬ 
ing  developments  and  applications  supported  by  SRC.  For  example,  advanced  MOS  mobility 
model  development  and  characterization  of  ESD  are  two  primary  focus  areas  at  present  under  the 
SRC  contract.  The  work  on  mixed-mode  simulation  has  received  support  from  both  ARO  and 
SRC.  With  the  ending  of  this  contract,  the  underlying  support  associated  with  the  AASERT  Fel¬ 
lowship  for  Francis  Rotella  now  comes  within  the  scope  of  the  SRC  contract  on  PISCES.  Table  1 
gives  a  summary  of  the  various  staff  members  funded  over  this  contract  and  their  employment  sta¬ 
tus,  including  those  still  involved  in  graduate  education. 
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The  ongoing  work  on  parallel  algorithms  and  their  demonstration  in  the  context  of  the  HPC  initia¬ 
tive  now  falls  under  the  support  of  ARPA  contracts.  Over  this  contract  period,  there  have  been 
several  demonstrations  of  major  significance  that  have  drawn  directly  on  efforts  started  under  pre¬ 
vious  ARO  contracts.  Specifically,  the  benchmarking  of  STRIDE  on  the  Caltech  “Delta”  machine 
with  520  processors  demonstrated  simulation  results  for  a  bipolar  transistor  with  nearly  5  million 
grid— probably  the  world's  largest  device  simulation.  The  parallelization  of  PISCES  has  also 
resulted  in  important  benchmark  results  on  both  Intel  (iPSC/860)  and  TMC  (CM/5)  machines 
with  up  to  64  processors.  Ongoing  work  in  this  area  is  in  fact  expected  to  have  a  direct  impact  on 
commercial  application  codes  both  in  the  TCAD  area  as  well  as  other  engineering  applications 
that  use  direct  solvers. 

There  are  several  aspects  of  the  present  work  that  deserve  special  attention  and  are  attractive  for 
follow-on  efforts.  The  application  domain  of  OEIC  device  and  circuit  modeling  is  considered  to 
be  the  prime  candidate  for  further  research  effort.  While  there  have  been  very  promising  results  in 
both  the  transport  modeling  of  the  devices  as  well  as  in  supporting  mixed-mode  simulations  dem¬ 
onstrated  during  this  contract  period,  there  are  a  number  of  major  advances  still  required  in  order 
to  support  full  prototype  applications.  First,  the  quantum  confinement/transport,  including 
detailed  consideration  of  the  role  of  phonons  in  the  equilibration  process,  requires  the  develop¬ 
ment  of  a  physically  consistent  formulation.  Second,  the  complete  linking  of  the  electrical  and 
optical  effects  needs  further  investigation  as  well.  In  support  of  this  system-level  modeling  there 
are  associated  needs  for  better  framework  integration  tools  and  methodology.  Finally,  the  overall 
issue  of  packaging  of  OEICs  offers  a  very  fertile  ground  for  new  efforts.  Since  nearly  half  the  sys¬ 
tem  cost  of  OEICs  comes  from  the  packaging,  there  is  a  real  opportunity  to  impact  both  perfor¬ 
mance  and  cost  issues  for  this  key  new  technology. 

There  are  certainly  other  opportunities  to  further  advance  the  TCAD  support  for  advanced  device 
technologies.  The  area  of  wireless  communications  is  now  of  major  commercial  interest  and  has 
ongoing  impact  on  defense  systems.  The  present  contract  has  resulted  in  key  demonstrations  of 
improved  parasitic  modeling  and  ac  high  frequency  characterization.  The  opportunities  to  further 
develop  mixed-mode  TCAD  for  high  frequency  capabilities  in  support  of  MMIC  applications 
including  both  silicon-  and  GaAs-based  technologies  is  thus  strongly  recommended.  In  fact,  some 
of  the  demonstrations  of  new  physical  modeling  capabilities  for  GaAs  (i.e.  deep  level  trap  model¬ 
ing)  now  open  the  way  to  more  completely  characterize  the  analog/digital  trade-offs  for  this  rela¬ 
tively  mature  but  often  troublesome  technology  base.  On  the  silicon  side,  the  recent  advances  in 
SOI  technology  appears  to  offer  great  promise  for  further  advances  into  the  microwave  operating 
regime.  Yet  the  trade-offs  in  parasitic  effects  and  overall  technology  integration  path  need  to  be 
further  scrutinized.  The  range  of  tools  and  techniques  demonstrated  in  this  contract  indeed  pro¬ 
vide  the  cornerstones  for  further  application  and  development  in  the  area  of  MMIC  analysis  and 
design. 
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Abstract—  An  improved  energy  transport  model  for  device 
simulation  is  derived  from  the  zeroth  and  second  moments  of 
the  Boltzmann  transport  equation  (BTE)  and  from  the  pre¬ 
sumed  functional  form  of  the  even  part  of  the  carrier  distribu¬ 
tion  in  momentum  space.  Energy-band  nonparabolicity  and 
non-Maxwellian  distribution  effects  are  included  to  first  order. 
The  model  is  amenable  to  an  efficient  self-consistent  discretiza¬ 
tion  taking  advantage  of  the  similarity  between  current  and 
energy  flow  equations.  Numerical  results  for  ballistic  diodes  and 
MOSFET's  are  presented.  Typical  spurious  velocity  overshoot 
spikes,  obtained  in  conventional  hydrodynamic  (HD)  simula¬ 
tions  of  ballistic  diodes,  are  virtually  eliminated. 


THE  drift-diffusion  (DD)  model,  which  assumes  a  local 
equilibrium  of  mean  carrier  energy,  is  usually  not  accu¬ 
rate  enough  for  submicrometer  device  modeling,  owing  to 
the  rapidly  changing  electric  fields  (see,  for  example,  [1],  [2] 
and  references  therein).  The  conventional  hydrodynamic  (HD) 
model,  which  also  incorporates  momentum  and  energy  bal¬ 
ance  equations,  and  in  the  more  complete  form  even  includes 
the  convective  term  for  the  carrier  velocity,  is  known  to  yield 
unphysical  solutions  in  some  cases,  e.g.,  the  spurious  veloc¬ 
ity  overshoot  spike  in  n+-n-n+  ballistic  diodes.  Numerical 
stability  is  also  a  difficult  problem,  due  to  the  hyperbolic 
nature  of  the  model  equations.  An  improved  energy  transport 
(ET)  model,  which  overcomes  the  physical  limitations  and 
preserves  the  excellent  numerical  stability  and  efficiency  of 
the  DD  model,  is  presented  in  this  paper  as  an  alternative  to 
more  computationally  expensive  methods,  such  as  the  Monte 
Carlo  (MC)  model. 
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For  electrons,  the  zeroth  and  second-order  moments  of  the 
steady-state  Boltzmann  transport  equation  (BTE)  give  the 
carrier  continuity  and  energy  balance  equations: 

V  J=  -q(G-R)  (1) 

IdE  \ 

V-S  =  F-J-n—  (2) 

\  "t  cou  / 

where  the  current  J  and  the  energy  flow  5  are  defined  by 
J  =  -qn(v)  =  ~qj  d^kvf  and  S  =  n(Ev)  =  /  d'kEvf. 
Here,  G  -  R  stands  for  the  carrier  generation- recombina¬ 
tion  rate,  F  is  the  electric  field,  q  is  the  elementary  charge, 
n  is  the  carrier  concentration,  v  is  the  carrier  velocity,  k  is 
the  crystal  momentum,  /  is  the  carrier  distribution  function, 
and  £  is  the  carrier  energy.  The  angular  brackets  denote 
averages  over  the  distribution  function,  i.e.,  (A)  = 
/  dikAf/  J  dfkf.  The  last  term  in  (2)  is  the  energy  loss 
rate  due  to  collisions.  By  decomposing  /  into  its  even  and 
odd  parts,  /0  and  /,.  and  using  the  microscopic  relaxation 
time  approximation  (d/,  fdt)col]  =  -/,  /r  [3],  the  odd  part 
of  the  BTE  becomes 
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Substitution  of  (3)  into  the  expressions  for  J  and  S  yields 


J  =  -q  J  d3kvft  =  q[niiF  +  V  ■  (nD))  (4) 
S  =  J  d'kEvft  =  -nfrEF  -  V  ■  ( nDE )  (5) 


where  all  the  transport  coefficients  D,  /x£,  and  DE  are 
tensors  and  their  components  are  defined  by 
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DE  = 

(t£v,v,)„. 

(7) 

Here  ()0  denotes  average  over  /„.  The  divergence  of  the 
tensors  in  (4)  and  (5)  means  to  take  the  divergence  of  each 


0741-3106/92S03.00  C  1992  IEEE 


row  of  the  tensor  as  a  vector  element.  It  is  worth  noticing 
that  the  constraint  /,  <  /„,  which  is  assumed  in  the  Legen¬ 
dre  polynomial  expansion  approach  in  [3]  and  the  perturbed 
distribution  approach  in  [4],  is  not  necessary  in  this  model. 

For  materials  like  Si,  where  the  mean  kinetic  energy 
m{  vf  /2  is  much  smaller  than  the  mean  thermal  energy 
m{v2)/ 2.  the  correction  from  the  field-induced  anisotropy  is 
generally  not  important  [5].  Therefore.  /0  can  be  approxi¬ 
mated  by  an  isotropic  distribution  /0(  £)  and  £,  E>,  {iE .  and 
DE  become  scalar  functions  of  the  mean  carrier  energy  (£). 
We  assume  in  the  following  t  =  r(  E)  oc  E~p .  where  p  is 
determined  by  the  dominant  scattering  mechanism  and  ranges 
from  0.5  to  1  according  to  our  preliminary  bulk  MC  simula¬ 
tions.  Additional  hot-carrier  effects  are  included  by  incorpo¬ 
rating  the  following  first-order  approximations:  a)  non¬ 
parabolic  bands.  h2k2 /2m*  =  £(1  +  aE):  and  b)  non- 
Maxweliian  effects:  /0  =  (1  +  y£ / k BTe)fm(E).  Here. 
fm(  E)  is  the  Maxwellian  distribution  at  an  elevated  tempera¬ 
ture  Te  and  7  is  the  non-Maxwellian  parameter.  With  these 
assumptions  and  from  (6)  and  (7),  the  modified  Einstein 
relations  are  valid  not  only  for  n  and  D  but  also  for  nE  and 
De,  i.e.,  D  =  kBTmn/q  and  DE  =  kBTmnE/q.  where  the 
equivalent  carrier  temperature  Tm  =  (1  +  7 )Te  is  related  to 
the  mean  carrier  energy  by 


*/=  T“  Nj2).  (12) 

‘■'u 

LtJ  is  the  distance  between  nodes  i  and  j,  B(x )  =  x/exp 
( x )  -  1)  is  the  Bernoulli  function,  and  utJ  =  -  \p,)/T, 

with  T  =  /*/  T  dx/Ljj.  Usually,  the  carrier  temperature 
varies  much  more  slowly  than  the  potential  does.  Therefore, 
the  linear  approximation  T  =  (f,  +  7) ) ,2  is  reasonable  in 
practice.  This  also  implies  a  quadratic  approximation  in  space 
for  the  potential  according  to  the  above  discretization  as¬ 
sumptions. 

Since  £=  -V^,  by  substituting  F  ■  J  with  -V  •  (\/J) 
+  \J/V  ■  J  and  employing  a  macroscopic  relaxation  time  ap¬ 
proximation  for  ((3£ / dt)coll).  we  can  rewrite  (2)  as 

(£)  -  £0 

V  •  H  =  *  •  (C  -  R)  -  n - -  (13) 

T, 

where  £0  =  3kBT0/2,  with  7"0  being  the  lattice  tempera¬ 
ture.  rf  is  the  energy  relaxation  time,  and  H  =  S  +  4*J  is 
the  total  energy  flow,  consisting  of  both  the  thermal  energy 
flow  S  and  the  electrical  flow  i pJ.  From  (11)  and  (12).  and 
using  \p  =  (yp,  +  ^  )/2.  we  can  express  the  total  energy  flow 
on  the  mesh  line  as 


(£)=  (1  +  7  +  ^«**t;)  3-kBTe 

=  (l  +  5-*kBTmy-kBTm.  (8) 

Then,  we  can  recast  (4)  and  (5)  into 

J  =  q(-nnV4<  +  kBV(niiTm))  (9) 

-nnkBTmVt+  ^  V(«^)J  (10) 

where  \p  is  the  electrical  potential  and  Ce  =  ^E /(kBTmn)  = 
(5/2  -  p)(  1  -  akgTm/2)  is  the  energy  flow  factor  Ce.  Ce 
is  in  general  a  smoothly  varying  function  of  the  mean  carrier 
energy  and  for  nearly  parabolic  bands  Ce  =  5/2  -  p.  It  is 
important  to  notice  that  in  our  approach  the  Wiedmann- Franz 
law  for  heat  flow  is  never  invoked,  as  it  is  usually  necessary 
in  the  conventional  HD  method  to  obtain  a  closure  of  the 
moment  equations. 

Based  on  the  similarity  of  the  functional  expressions  in  (9) 
and  (10),  an  extended  Scharfetter-Gummel  method  [6]  was 
developed  for  self-consistent  discretization  of  both  the  carrier 
continuity  equation  and  the  energy  balance  equation.  By 
assuming  that  along  each  mesh  line  J,  S,  and  (1  /  T)(d\p  / dl) 
are  constant,  we  can  solve  1-D  forms  of  both  (9)  and  (10)  on 
the  mesh  lines  and  perform  the  standard  box  integration 
around  each  node  [7],  With  the  following  normalized  nota¬ 
tion:  J  =  J/q.  S  =  -S/qCe,  N  -  nyi,  f  -  kgTm/q.  (9) 
and  (10)  are  evaluated  on  the  mesh  line  connecting  the  nodes 
i  and  j  by 

j,  =  —  (BiujNjj-  (11) 


q(tl-  CeS,) 

=  (B{ujJ)NJfJ(\p  -  CeTj) 

‘-•j 

-Bi-u^N'f'd-Cj,)).  (14) 


In  this  improved  ET  model,  the  necessary  transport  parame¬ 
ters  Ce(Tm).  and  t,(T„)  can  all  be  determined,  for 

instance,  from  bulk  MC  simulations  by  evaluating  the  follow¬ 
ing  ensemble  averages: 


(v)  ^  3  (£■>)  (£)  -  £p 

T'  e  =  2  (E)(v)  'T,=  ~  qF(v) 


(15) 


We  have  tested  the  model  by  simulating  typical  Si  n*-n-n* 
ballistic  diodes  and  MOSFET's.  For  simplicity,  we  assume 
n(Tm)  =  n0T0/Tm,  Ce  =  3/2.  and  r,  =  0.4  ps.  Also,  we 
have  used  the  modified  Caughey -Thomas  model  [7]  for  the 
low-field  mobility  n0  and  the  surface  scattering  model  given 
in  [8].  A  full  Newton  method  is  employed  and  quadratic 
convergence  can  be  reached  within  7-8  iterations  in  all 
cases,  from  a  good  initial  guess.  A  comparison  of  results  for 
the  drift  velocity  in  an  n+-n-n+  structure,  using  different 
models,  is  given  in  Fig.  1.  It  is  clear  that  the  ET  model  can 
describe  velocity  overshoot  with  reasonable  accuracy,  when 
compared  to  the  Monte  Carlo  result.  The  spurious  velocity 
overshoot  spike  at  the  anode  junction,  which  appears  in 
solutions  with  the  HD  models  [9],  is  virtually  absent  in  our 
results.  We  believe  that  the  momentum  relaxation  time  ap¬ 
proximation.  used  in  the  formulation  of  the  HD  model  but 
not  invoked  by  the  ET  model,  is  responsible  for  the  spurious 
velocity  peak.  For  a  detailed  explanation,  the  reader  is 
referred  to  a  forthcoming  publication  (10). 


length  [  fim  j 

Fig.  1.  Drift  velocity  profiles  for  a  Si  n'f-n-n*  structure,  with  abrupt 
junctions.  n*  =  5  x  1017  cm*3,  and  n  =  2  x  1015  cm'3.  The  length  of  the 
n-region  is  0.4  pm  and  bias  applied  to  the  collector  n*  region  on  the  right  is 
1.5  V.  Results  are  obtained  using  the  energy  transport  (solid  line),  drift-dif¬ 
fusion  (dotted  line),  hydrodynamic  (dot-dashed  line)  and  Monte  Carlo  (dashed 
line)  models. 


length  (  pm  ) 

Fig.  2.  Electron  temperature  (Tm)  contours  obtained  for  a  typical  0.4-pm 
implanted  channel  n-MOSFET.  solved  with  the  ET  model  on  a  40  x  40 
uniform  rectangular  grid.  Maximum  doping  in  the  implanted  channel  is 
2  x  I017  cm'  .  The  applied  biases  are  Vcs  -  2.0  V.  =  0.0  V.  and 
VDS  =  2.0  V.  The  step  between  temperature  contour  lines  is  200  K.  The 
bold  dotted  lines  indicate  the  position  of  the  source  (left)  and  drain  (right) 
junctions. 

To  further  test  the  new  model  and  the  new  discretization 
scheme  in  2-D,  a  typical  0.4-pm  implanted  channel  n- 
MOSFET  has  been  simulated.  The  model  was  implemented 
in  the  PISCES  II  simulation  code.  For  simplicity,  although 
this  does  not  reflect  a  limitation  of  the  approach,  the  dis¬ 
cretization  nodes  are  distributed  on  a  uniform  rectangular 


pattern.  In  Fig.  2  we  show  the  temperature  contours  for  a 
40  x  40  grid.  The  introduction  of  surface  scattering  in  the 
model  did  not  appear  to  alter  the  convergence  properties  of 
the  scheme.  The  stability  of  the  solution  was  still  excellent 
when  we  simulated  the  same  structure  on  a  rather  coarse 
10  x  10  grid.  This  is  a  good  indication  that  the  discretization 
scheme  is  well-posed.  Our  results  are  very  encouraging  and 
show  that  this  ET  model  should  be  suitable  for  the  simulation 
of  general  2-D  structures. 
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Abstract  Dual  energy  transport  (DUET)  model  in  semiconductor  devices  including  heterostructures  has 
been  developed  to  simulate  the  distribution  of  carrier  and  lattice  temperatures  in  addition  to  profiles  of 
the  electrostatic  potential  and  carrier  concentrations.  The  modeling  approach  is  in  consistency  with  the 
conventional  drift-diffusion  (DD)  model,  making  it  easy  to  implement  in  the  existing  code.  Carrier  energy 
dependent  mobility  and  impact  ionization  models  have  been  examined  and  are  used  for  simulation  of  various 
velocity  overshoot  and  hot  electron  effects.  Two  simulation  examples,  one  for  submicron  MOSFET  and  one 
for  deep-submicron  SOI,  are  presented  through  comparison  with  the  measurement  data  to  demonstrate  the 
improvement  of  the  new  model  over  DD  model  in  predicting  the  device  characteristics  for  modern  (submicron) 
structures. 

DUET  Transport  Model  DUET  model  uses  six  state  variables  -  potential  (V>),  electron  and  hole  con- 
centrations  and  temperatures  (n,  p,  Tn,  and  Tp),  and  lattice  temperature  (Tt)  to  describe  the  status  of 
a  semiconductor  device.  Except  of  Shockley  semiconductor  equations  for  ip,  n,  and  p  and  thermal  diffu¬ 
sion  equation  for  Ti ,  the  governing  equations  for  carrier  temperatures  are  derived  from  the  energy  balance 
principle  which  in  the  form  of  the  Fick’s  second  law  can  be  written  as  follows: 

dw„ 

Qt  -V  '  Sn  +  jn  '  Sri  Utim  ( 1 ) 

where  electrons  are  used  as  the  example  and  wn  is  the  electron  kinetic  energy  density,  s„  is  the  energy 
flux,  £n  is  the  electric  field  acting  on  electrons,  and  u,„n  is  the  net  energy  loss  rate.  The  other  symbols 
have  conventional  meanings.  The  key  issues  in  the  model  development  are  to  obtain  the  expressions  for 
carrier/energy  fluxes  in  terms  of  the  state  variables  and  to  identify  the  mechanisms  for  carrier  energy  gain 
and  loss.  In  DUET  the  distribution  function  in  the  real  and  crystal  momentum  space  is  constructed  from  the 
Fermi-Dirac  statistics  for  the  quasi-thermal  equilibrium  using  the  relaxation  time  approximation  [1],  so  all 
the  transport  coefficients  and  flux  expressions  can  rigorously  be  derived.  For  example,  the  electron  current 
density  and  energy  flux  are  computed  from  the  following  expressions: 

j„  =  npnVEFn  +  qnfinQ„VTn  (2) 

s«  =  ~Pn1 njn  Kn^7Tn  (3) 

where  Q,  P,  and  k  are  thermopower,  thermoelectric  power,  and  thermal  conductivity,  respectively,  and 
all  of  which  can  be  related  to  the  mobility  p.  The  energy  exchange  mechanisms  considered  in  the  model 
include  SRH,  Auger,  and  radiative  recombinations,  and  impact  ionization.  Together  with  proper  boundary- 
conditions  and  carrier  energy  dependent  models  for  mobility  and  impact  ionization,  the  model  has  been 

imp  emented  in  PISCES  (version  2Et)  [2]  and  the  initial  test  of  the  code  showed  very  encouracine  results  as 

presented  below. 

Simulation  Examples  The  following  two  examples  show  that  DD  model  is  no  longer  accurate  in  pre¬ 
dicting  I-  V  characteristics  for  submicron  devices  when  the  non-station  ary  phenomena  such  as  the  velocity- 
overshoot  and  nonlocal  field  dependence  of  physical  parameters  such  as  the  impact  ionization  rate  become 
important.  While  both  DD  and  DUET  models  provide  good  simulation  results  compared  to  the  measure¬ 
ment  for  devices  with  relatively  long  channel  length,  DD  model  starts  to  break  for  output  characteristics  of 

k  jnnw  ~  012#i/Flg'  *)  4,1(1  substrate  current  of  MOSFET  at  LtJ)  =  0.8  ji  (Fig.  2).  On  the  other 
hand  DUET  can  consistently  model  the  device  characteristics  well  even  when  the  device  size  is  scaled  down 
to  the  deep  submicron  range.  Although  we  are  now  still  calibrating  the  model  parameters  for  DUET  in 
order  to  improve  the  simulation  accuracy,  the  examples  presented  clearly  indicate  that  the  DUET  modelinc 
approach  is  justified. 

Appendix  Complete  set  of  DUET  equations  and  expressions  for  the  energy  dependent  mobility  and  impact 
ionization  rate. 
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Appendix 


Poisson’s  equation 

V-(-tVrf,)  =  q(p-n  +  N+-N;) 
Carrier  continuity  equations: 

dn  1 

a  = 

ap  i_  . 

_  = 

Energy  balance  equations  for  carriers: 

dwn 
dt 
dw 


—  ^  *  Sn  Jn  '  Uwn 


UWp 

Qt  —  —  v  •  Sp  +  jp  •  £p  — ■  uWp 


(1) 

(2) 

(3) 


where 


Uu m  —  (u»rh  "+■  Uraj)  — “ 


(un,.Auper  —  Sn,imp )  j^£j(2x,)  +  j^-B^pj 

3l  ,  wn(T„)  —  UI„(T.l) 

~Sp,tmpX*BJn  +  - 

^  Twin 


i^u/p  =  (Ujrfc  +  Urod)  2  ^B^p  “ 


(up,  Augtr  ~  9p ,  imp) 


i(r£)  +  |*Brn] 


-gn.imp^BTp  + 


*  u/p 


Thermal  diffusion  equation  for  lattice: 

CL^r  =  v  •  (*£vr^) 


+u,rh  2^®^"  +  Es(Ti)  +^A:bTp| 

,  ^n(Tn)  -  |  Uy(Tp)  -  Uy^) 


'wn 


'u/p 


Mobility  (Haensch’s)  and  impact  ionization  rate  (Slot- 
boom)  models: 


M N,Tl,E±,T' )  = 


7(N,TL,E±)  = 

a  =  Aexp[-(6/^<//n 

r  3  kBTe-TL 

l*U  =  o - 

2  9  Twvla( 


Figure  1:  Simulation  results  for  the  substrate  current  in 
MOSFET  with  two  different  channel  length  (2  and  0.8 
fim.)  and  the  comparison  is  made  for  0.8  nm  case  between 
the  ET-simulated  and  measured  results  (from  MIT  and 
UC  Berkeley,  respectively).  The  upper  curves  are  simu¬ 
lated  using  DD  model  while  the  lower  curves  are  obtained 
from  ET  simulation. 


Figure  2:  Simulated  and  measured  data  for  SOI  structure 
with  different  channel  length. 
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The  carrier  transport  mechanism  in  GaAs  which  includes  a  negative  different  mobility 
in  the  carrier  velocity-field  curve  when  implemented  in  the  2D  device  simulator,  causes 
grid-dependent  numerical  instability  and  poor  convergence  behavior.  It  has  well  been 
established  in  the  literature  [1,  2]  that  the  conventional  drift-diffusion  (DD)  approximation 
which  assumes  a  local  thermal  equilibrium  among  the  same  type  of  carriers  (electrons  or 
holes)  has  significant  shortcomings  in  the  description  of  GaAs  device  operation.  In  this 
paper  we  report  an  extension  of  the  energy  transport  (ET)  model  [3,  4]  in  which  carrier 
energy  conservation  equations  are  solved  to  GaAs  device  simulation.  A  technique  whereby 
the  empirical  form  of  the  field-dependent  mobility  is  transformed  to  an  energy-dependent 
one  used  in  the  ET  model  is  presented. 

The  empirical  formula  describing  the  dependence  of  mobility  on  electric  field  imple¬ 
mented  in  the  2D  device  simulator  PISCES  [6]  has  the  form  fi{Nr,  E)  =  , 

1  '  '  E0  ' 

where  fia  is  the  low-field  mobility,  E  is  the  magnitude  of  the  electric  field,  Ea  is  the  critical 
electric  field,  and  vaat  is  the  saturation  drift  velocity.  In  applying  the  ET  model  to  GaAs, 
a  formulation  describing  the  dependence  of  carrier  velocity  on  local  temperature  needs 
to  be  determined.  From  the  second-order  moment  of  the  Boltzmann  transport  equation, 
the  energy  conservation  equation  at  steady  state  is  given  byV-S  =  E-  J  —  n(^  ). 

The  macroscopic  quantities  are:  energy  flux  S,  electric  field  vector  E,  electron  current 
density  J,  electron  concentration  n,  and  the  average  electron  energy  loss  due  to  collision 
( dE/dt\cou ).  Since  the  empirical  formula  for  //(E)  were  obtained  experimentally  under 
uniform  field  condition  where  V  •  S  =  0  and  J  =  qnn(E)E,  by  using  the  assumptions 
in  that  the  average  energy  loss  term  (  we  arrive  at  a  transcendental 

relation  of  local  field  and  carrier  energy, 


l*(E)Ez  =  - 


3  k(Te  -  Tl) 


qru 


(1) 


Here,  k  is  the  Boltzmann  constant,  denotes  the  energy  relaxation  time,  and  Te  and 
Tl  are  the  electron  and  lattice  temperatures  respectively.  Hence,  the  carrier  temperature 
can  be  solved  for  a  given  field  strength,  and  the  dependence  of  mobility  on  local  carrier 
energy  can  be  obtained. 

During  actual  implementation  in  the  device  simulator  PISCES,  it  is  more  convenient 
to  have  an  analytical  form  in  which  the  mobility  dependence  on  local  carrier  temperature 
can  be  solved  directly.  At  low  fields  we  assume  J  =  qn(i0E ,  hence 


E  =  and  /*( T '.) 


i+[ 


°  2  QUoTu,  E%  v  Tl 


2  quoTw  Eq  tl  j 


(2) 


°|Monte-Carlo  Simulation  results  reported  in  Ref.  [5]  were  used  to  extract  the  energy  relaxation  time, 
rw  =  0.8  x  10“ 12  second. 
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(3) 


At  high  fields  we  make  the  assumption  that  J  =  qnvtai ,  hence 

IT -3  *77,  1/2 1  1V  i  (T  7  _  'JZlBotTr-'W 

The  mobility  relations  can  thus  be  formulated  approximately  as  a  weighted  average  of 


fi(NT,  Te)  =  0.5  •  (1  -  F)  ■  (iL(NT,  Te)  +  0.5  •  (1  +  T)  ■  Te)  ,  (4) 

where  T  =  tanh[o(Te  —  Tc)\  is  a  fitting  function  in  which  a  and  Tc  are  fitting  parameters. 
The  average  carrier  velocities  as  a  function  of  carrier  temperature,  v  —  n(Te)E(Te),  by 
using  the  transcendental  relation  of  Eq.  (1)  and  fitting  function  of  Eq.  (4),  are  shown  in 
Fig.  1. 

We  simulated  a  non-homogeneous  GaAs  n+  —  n  —  n+  diode  featuring  a  5-/xm  n-region 
doped  to  1  x  1017  cm-3  and  O.T/im  n+-region  doped  5  x  1019  cm"3  at  two  ends.  The 
corresponding  current-voltage  characteristics  using  different  grids  of  uniform  spacing  are 
shown  in  Fig.  2.  The  DD  model  which  uses  the  local  field  formulation  tends  to  converge 
very  slowly.  It  can  be  seen  that  near  electric  field  strengths  at  which  intervalley  carrier 
transfer  occurs  and  Gunn-domain  forms,  simulated  currents  display  non-physical  'zig¬ 
zags’,  and  their  periodicity  is  extremely  grid  sensitive.  Simulated  net  charge  distribution 
for  a  l-fim  GaAs  n+  —  n  —  n+  diode  at  14* =1.0  volt  is  shown  in  Fig.  3. 

We  also  simulated  a  GaAs  MESFET  structure  featuring  a  1  /im  gate  and  0.8  /mi 
spacers.  The  drain-source  current-voltage  relationship  is  compared  using  both  the  DD 
and  ET  model  in  Fig.  4.  The  DD  model  exhibits  similar  non-physical  ’zig-zags’  in  the 
drain  current  -  an  indication  that  the  present  grid  density  is  insufficient  to  resolve  the 
Gunn-domain.  On  the  other  hand,  the  ET  model  yields  numerically  stable  solutions  with 
excellent  convergency  rate  without  incurring  an  additional  computation  cost. 

In  summary,  the  energy  transport  formulation  that  accounts  for  the  principle  non¬ 
stationary  effect  has  been  applied  to  GaAs  MESFET  device  simulation.  The  excellent 
convergence  behavior  of  the  ET  model  as  demonstrated  in  the  GaAs-based  devices  permits 
a  wide  application  to  III-V  material  based  devices. 
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Figures 


Fig.  1:  GaAs  average  carrier  drift  velocity 
as  a  function  of  electron  temperature  for 
different  low-field  mobility  values. 


Fig.  3:  Simulated  net  charge  distribution 
along  a  1  diode  using  a  200x3  uniform 
grid.  - ET  and - DD  model. 


Fig.  2:  Current-voltage  characteristics  of 
a  5-/xm  GaAs  n+  —  n  —  n+  diode  using 
different  grids. 


Fig.  4:  Comparison  of  simulated  drain 

current  of  GaAs  MESFET  using - ET 

model  and - DD  model. 


Accurate  Modeling  of  GaAs  MESFET  Sidegating  Effects 

By  Trapping  Simulation 


Yi  Liu,  Zhiping  Yu,  Robert  W.  Dutton,  and  Michael  D.  Deal 
AEL  204,  IC  Lab,  Stanford  University,  CA  94305.  Phone:  (415)  725-3644 
Quantum  Electronics  and  Compound  Semiconductor  Devices 

Abstract  GaAs  MESFETs  play  an  increasingly  critical  role  in  digital  mobile  communications  owing  to 
their  low  power  dissipation  [1].  However,  sidegating  effects  which  are  unique  for  devices  built  on  semi- 
insulating  (SI)  GaAs  substrates  cause  cross-talk  of  neighboring  MESFETs,  thus  thwarting  increased  IC 
density.  In  this  paper,  a  detailed  analysis  of  carrier  trapping  processes  in  the  substrate  is  conducted  using 
an  efficient  numerical  trapping  model.  Two  major  mechanisms  are  revealed  to  be  responsible  for  sidegating 
phenomena.  One  is  the  formation  of  a  Gunn  domain  at  the  channel/substrate  interface.  The  other  is  the 
substrate  conduction  between  a  parasitic  Schottky  contact  and  sidegate  due  to  carrier  injection.  For  the  first 
time,  quantitative  agreement  between  the  measured  and  simulated  device  characteristics  is  obtained.  Among 
key  findings  of  this  study  is  the  confirmation  of  correct  electron  mobility  model  used  in  FET  simulation. 
Design  issues  for  improving  circuit  performance  will  also  be  discussed. 

Test  Structures  and  Measured  Sidegating  Characteristics  The  fabricated  test  structure  for 
studying  sidegating  effects  is  schematically  shown  in  Figure  1.  The  purpose  of  using  an  ungated  device  and 
a  separate  Schottky  contact  is  to  identify  and  isolate  the  effects  of  the  parasitic  Schottky  contact.  With  the 
Schottky  contact  left  floating,  the  measured  I-V  characteristics  of  the  drain  current  vs.  sidegate  bias  exhibit 
a  distinct  threshold  behavior  (Figure  2(a)).  When  the  Schottky  contact  is  grounded,  the  drain  current 
reduction  due  to  the  sidegating  effect  is  much  more  severe  (Figure  2(b)). 

Simulation  and  Explanation  In  order  to  analyze  the  carrier  trapping  effect  on  the  device  characteris¬ 
tics,  a  two-level  numerical  trapping  model  is  implemented  in  device  simulator  PISCES  [2],  The  flowchart  for 
this  model  is  outlined  in  Figure  3.  For  dc  analysis,  a  deep-impurity  model  which  assumes  a  single  quasi-Fermi 
level  for  each  type  of  carrier  in  both  band  and  trap  level(s)  is  sufficient,  while  trap  rate  equation(s)  must  be 
included  in  basic  semiconductor  equations  for  ac  and  transient  analyses.  The  trapping  model  is  implemented 
in  such  a  way  that  the  number  of  independent  variables  is  not  increased  in  the  solution  process,  thus  keeping 
the  memory  requirements  and  code  modifications  to  a  minimum.  Figure  4(a)  shows  the  vertical  structure 
for  simulation.  The  simulated  drain  current  decreases  with  the  increasing  negative  sidegate  voltage  with 
a  threshold  behavior  only  when  the  negative-differential-mobility  (NDM)  model  is  used  in  the  simulation 
(Figure  5).  The  simulation  shows  that  a  negative  space  charge  region  appears  on  the  substrate  side  at 
the  channel/substrate  interface  when  the  electric  field  in  the  substrate  reaches  a  critical  value  (Figure  6). 
explaining  the  threshold  behavior  in  Figure  2(a).  The  formation  of  the  Gunn  domain  because  of  the  NDM 
has  previously  been  studied  analytically  by  Boer  for  Gunn  diodes  in  [3].  This  negative  space  charge  region 
is  due  to  the  enhanced  trapping  of  electrons  by  deep-level  electron-traps  because  of  the  electron  accumu¬ 
lation  in  the  Gunn  domain  on  the  substrate  side.  Electrons  are  depleted  on  the  channel  side,  leading  to  a 
decrease  in  the  drain  current.  The  simulated  drain  current  vs.  the  sidegate  voltage  using  more  realistic  lat¬ 
eral  structure  in  Figure  4(b)  shows  an  excellent  agreement  with  the  measured  data  (Figure  2(a)).  Figure  7 
shows  the  simulation  structures  that  include  Schottky  contacts.  The  charge  distribution  simulated  using 
the  vertical  structure  in  Figure  7(a)  shows  that  negative  space  charge  is  created  in  substrate  region  at  the 
channel/substrate  interface  (Figure  8),  which  again  is  due  to  the  enhanced  trapping  of  electrons.  But  in 
this  case,  the  electron  trapping  is  a  consequence  of  the  process  of  electron- injection  into  the  substrate  from 
the  sidegate  and  hole-injection  from  the  Schottky  contact:  the  electron  concentration  in  the  region  at  the 
channel/substrate  interface  is  increased  due  to  the  electron-injection.  The  drain  current  reduction  in  this 
case  is  more  severe  because  the  on-set  of  the  conduction  in  the  Schottky-substrate-sidegate  structure  starts 
at  a  lower  sidegate  voltage  value  than  that  for  the  Gunn  domain  to  form.  The  simulated  drain  current  using 
the  lateral  structure  in  Figure  7(b)  shows  a  good  agreement  with  the  measured  data  (Figure  2(b)). 

Discussion  of  Design  Issues  Based  on  results  from  both  experiments  and  simulations,  direct  Schottky 
contact  on  substrate  should  be  avoided.  Methods  are  being  developed  to  reduce  or  eliminate  the  contact 
on  the  substrate  in  the  overlapping  portion  of  the  gate  contact  of  the  MESFET.  The  supply  voltage  and 
spacing  between  devices  must  be  optimized  to  reduce  the  chance  of  the  Gunn  domain  formation. 
References 

[1]  K.  Miyatsuji,  ti  al. :  ISSCC  Digest,  34  (1994) 

[2]  Z.Yu,  D.Chen,  L.So,  R.W. Dutton,  “PISCES-2ET  document,”  Tech.  Rep.,  Stanford  University  (1994) 

[3]  K.W.Boer,  G.Dohler,  Phy.  Rev.,  186,  793  (15  Oct.  1969) 


1 


ID/Id(Vsq=0)  Id/'d(Vsg=0) 


Figure  1.  Test  Structure  for  measunnents. 
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Figure  2.  Measured  and  simulated  sidegating 
effects  due  Gunn  domian  (a)  and  Schottky  con¬ 
tact  (b). 
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Figure  3.  Flowchart  for  the  trap  model 
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Figure  4.  Vertical  (a)  and  lateral  (b)  structures  for  simulations. 
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Figure  5.  Simulation  results  for  the  vertical  struc¬ 
ture  in  Figure  4(a)  using  different  mobility  models 
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Figure  7.  Vertical  (a)  and  lateral  (b)  structures  for  simu¬ 
lations  that  include  effects  of  the  Schottky  contact 
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Figure  6.  Charge  (a)  and  field  (b)  distributions  simu¬ 
lated  using  the  vertical  structure  in  Figure  4(a). 
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Figure  8.  Charge  distribution  simulated  using  the  ver- 
itcal  structure  in  Figure  7(a). 
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Robust  and  Efficient  ac  Analysis  of  High-speed  Devices  [8] 

Ke-Chih  Wu  Zhiping  Yu  Lydia  So  Robert  W.  Dutton  Junko  Sato-Iwanaga* 
Center  for  Integrated  Systems,  Stanford  University,  CA  94305 


abstract 

^  robust  iterative  algorithm  for  high  frequency  small- 
sjgnal  ac  analysis  is  presented  which  combines  a  novel  ma¬ 
trix  transformation  scheme  and  a  memory-efficient  itera¬ 
tive  method  -  TFQMR  (Transpose- Free  Quasi-Minimal 
Residual).  Excellent  convergence  behavior  is  demon¬ 
strated  for  frequencies  up  to  1015  Hz  without  requiring 
increase  in  array  size  used  in  dc  analysis.  Device  simulator 
(PISCES)  using  this  new  algorithm  extends  the  frequency 
range  of  ac  analysis  far  beyond  present  measurement  ca¬ 
pabilities,  enabling  us  to  investigate  the  device  behavior  at 
extremely  high  frequencies  with  reasonable  computation 
time  and  memory  requirement.  The  results  shown  here 
are  for  GaAs  MESFETs  but  the  techniques  are  applied  to 
bipolar  and  other  high-speed  devices  as  well. 

Introduction 

Small-signal  ac  analysis  is  ac  important  aspect  of  de¬ 
vice  simulation,  not  only  for  the  determination  of  the  fre¬ 
quency  response  of  devices  but  also  for  the  extraction  of 
device  parameters  such  as  the  cutoff  frequency,  fo,  and 
S— parameters.  However,  the  conventional  SOR  (Succes¬ 
sive  Over- Relaxation)  algorithm  used  in  many  device  sim¬ 
ulators  fails  to  converge  when  the  frequency  approaches 
Jt  [1],  and  algorithms  which  eventually  use  the  direct 
solution  method  at  high  frequencies  [3]  require  substan¬ 
tial  memory  space  that  must  be  allocated  even  for  dc 
or  low-frequency  ac  analysis.  In  this  paper,  we  demon¬ 
strate  a  new  algorithm  which  combines  a  novel  matrix 
transformation  scheme  and  a  memory-efficient  iterative 
method  -  TFQMR  (Transpose- Free  Quasi-Minimal  Resid¬ 
ual)  [5]  -  which  shows  excellent  convergence  behavior  (up 
to  1015HZ  in  our  test  examples)  without  requiring  in¬ 
creases  in  the  array  size  used  in  dc  analysis.  Device  simu¬ 
lator  (PISCES)  using  this  new  algorithm  extends  the  fre¬ 
quency  range  of  ac  analysis  far  beyond  present  measure¬ 
ment  capabilities,  enabling  us  to  investigate  the  device 
behavior  at  extremely  high  frequencies  with  reasonable 
computation  time  and  memory  requirement.  The  results 
shown  here  are  for  GaAs  MESFETs  but  the  techniques 
can  be  applied  to  bipolar  and  other  high-speed  devices  as 
well.  _ 
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The  small-signal  ac  simulation  involves  the  solution  of  the 
following  linear  system  for  x  =  xr  +  jxj  where  j  —  \/-T: 

where  J  is  the  Jacobian  matrix  from  the  dc  solution,  u 
is  the  angular  frequency  under  analysis,  D  is  a  diago¬ 
nal  matrix  with  zero  entries  for  the  Poisson’s  equation 
and  B  is  the  stimulus  in  real  number  which  has  non-zero 
entries  only  for  the  Poisson’s  equation.  Thus  DB  =  0, 
the  property  of  which  is  to  be  used  in  Eq.  (3).  When 
u>  increases,  the  off-diagonal  blocks  uD  becomes  increas¬ 
ingly  important,  eventually  leading  to  the  divergence  in 
the  SOR  method. 

However,  considering  A0  as  consisting  of  2  by  2  matrix 
blocks,  the  presence  of  off-diagonal  blocks  with  opposite 
signs  actually  enhances  the  determinant  of  the  whole  ma¬ 
trix.  This  observation  suggests  that  the  difficulties  in  ob¬ 
taining  convergent  solutions  may  be  overcome  by  proper 
matrix  transformation  [2].  Indeed,  if  a  transformation 
matrix  as  indicated  below 


/  I  wDDj1  \ 
\  -wDDj1  1  ) 


(2) 


is  introduced,  where  /  is  the  identity  matrix  and  Dj  is 
the  diagonal  of  J,  Eq.  (1)  becomes:  Arx  = 


/  J  +  u7DDjl D  u>DDj  1  Jo  \  /  xr  \  /  B  \ 

V  -uDDj'Jo  J  +  W2DDJ'D  )  \  *'  /Vo  ) 

(3) 

where  Jo  =  J  —  Dj.  It  is  immediately  clear  that  although 
the  magnitude  of  off-diagonal  blocks  still  increases  pro¬ 
portionally  with  w,  that  of  the  diagonal  entries  are  also 
enhanced  by  terms  proportional  to  w2,  avoiding  the  prob¬ 
lem  of  the  diagonal  blocks  being  overwhelmed  by  the  off- 
diagonals.  There  are  three  distinctive  ranges  of  frequen¬ 
cies  where  the  transformed  matrix  shows  different  charac¬ 
teristics.  At  low  frequencies,  which  can  be  quantified  by 
all  the  non-zero  entries  in  u>DDjl  being  much  less  than 
unity,  off-diagonal  blocks  can  be  considered  as  small  per¬ 
turbations  with  respect  to  the  diagonal  blocks  and  itera¬ 
tive  algorithms  should  do  well  with  or  without  the  trans¬ 
formation.  At  very  high  frequencies,  where  all  the  non¬ 
zero  entries  of  uDDJ1  are  much  greater  than  unity,  the 
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enhancement  to  the  diagonal  blocks  make  the  entire  ma¬ 
trix  become  increasingly  diagonally  dominant  which  im¬ 
plies  improving  performance  for  the  iterative  algorithms. 
At  the  intermediate  frequencies  between  these  extremes, 
however,  the  situation  is  not  as  clearly  cut  since  the  off- 
diagonal  blocks  can  no  longer  be  considered  as  small  per¬ 
turbations  while  the  growth  in  the  enhancements  to  the 
diagonal  blocks  has  not  caught  up  with  the  growth  in  the 
off-diagonals.  The  practical  implication  of  these  charac¬ 
teristics  will  be  examined  in  more  detail  with  simulation 
results.  The  above  discussion  suggests  that  in  interative 
algorithms  employing  preconditioner,  M,  use  of 

/  J  +  u2DDjlD  0 

Mt  =  , 

V  0  J  4-  u2DDj1D 

at  the  intermediate  and  extremely  high  frequencies  gives 
better  convergence  properties  than  does  the  use  of 

Mi :)  <5) 

As  an  iterative  algorithm,  the  SOR  method  has  the  ad¬ 
vantage  of  simplicity  and  minimum  requirement  for  data 
storage.  In  practice,  however,  we  have  found  that  its  con¬ 
vergence  depends  critically  on  the  condition  that  the  fre¬ 
quency  is  kept  below  a  sharply  defined  threshold  above 
which  the  SOR  iteration  would  simply  stop  converging 
and  would  diverge  even  for  a  small  increase  in  the  fre¬ 
quency.  Additional  difficulties  arise  in  SOR  from  its  need 
to  supply  a  proper  relaxation  coefficient,  which  is  difficult 
to  obtain.  In  contrast,  there  are  several  attractive  advan¬ 
tages  with  the  so-called  QMR  (Quasi-Minimal-Residual) 
iterative  algorithm  [4],  Until  quite  recently,  all  existing 
iterative  algorithms  either  use  non-minimization  proce¬ 
dure  in  generating  the  update  vectors,  which  results  in  a 
very  volatile  residual  reduction  process,  or  use  a  full  mini¬ 
mization  procedure  which  requires  increasing  data  storage 
as  the  iteration  proceeds.  By  applying  a  quasi-minimal 
residual  approach,  QMR  is  able  to  retain  the  advantage 
of  requiring  only  fixed  amount  of  data  storage.  At  the 
same  time,  the  residual  reduction  process  becomes  very 
smooth  with  only  occasional  increases  of  insignificance. 
Furthermore,  like  all  other  Krylov  subspace  method  based 
on  Lanzos  vectors,  no  external  parameters  are  needed 

[4] .  The  particular  QMR  algorithm  we  have  implemented 
is  the  TFQMR  (Transpose-Free  Quasi-Minimal-Residual) 

[5]  with  the  left  preconditioner  M  =  Mt  for  A  -  At 
and  M  =  M„  for  A  =  A0.  The  procedure  of  TFQMR  is 
summarized  in  the  following: 

(1)  Initialization: 

(a)  b  =  M~lB] 

(b)  y\  =  u0  =  r0  =  f  -  b,  x0  =  d0  =  0; 

n 


Figure  1:  Comparison  of  SOR  and  QMR  with  matr 
transformation. 

(c)  po  =  t0  =  ||r0||,  60  =  vo  =  0 
(2)  For  n  =  1,2, ...  do 

(a)  t>„_i  =  AfMjfcn-i; 

(b)  <7n-l  ~  r  •  Vn_l,  On  —  1  ~  Pn-l/o‘n~\\ 

(c)  Wn  =  Un— i  —  On  — l^n-li 

(d)  rn  =  rn_i  -  on—iM~ * A(un— i  +  yj„_,); 

(e)  u/2„+i  =  ||r„||,  uf2n  =  v/Urnll  ■  ||r„_jj|; 

(f )  For  m  =  2n  —  1, 2n  do 

•  dm  =  ym  4-  ($m-l  9m- 1  / On-i)dm_i; 

•  &m  —  Wm+l/Tm— 1,  Cm  —  \  j  \J\  ■+• 

•  —  Wm+lCm,  9m  —  C^jOn  — l! 

•  xm  ~  xm- 1  4"  qmdmi 

•  If  x  has  converged:  stop 

ifi)  Pn  ~  ^  'rni  0n  =  Pn/Pn- li 

(h)  un  =  r„  4-  0„y2n; 

(i)  Sttn+l  =  ti„  +  /?n(»/2n  +  Pntfln-l) 

In  [5],  \/l  +  rnTm  is  given  as  an  upper  limit  for  the  resid¬ 
ual  norm.  In  our  implementation,  the  residual  norm  is 
initially  assumed  to  be  equal  to  rm  (a  coefficient  of  unity) 
When  Tm  drops  below  the  tolerance,  the  true  residual 
norm  is  evaluated  which  also  gives  out  the  real  propor¬ 
tional  coefficient  between  the  residual  norm  and  rm.  If 
the  residual  norm  is  not  below  the  tolerance,  the  iteration 
continues  with  the  new  coefficient.  The  comparisons  ol 
the  numerical  performance  of  QMR  and  SOR  with  and 
without  the  matrix  transformation  are  shown  in  Figures 
1  and  2.  As  Figure  1  indicates,  the  difficulty  in  select¬ 
ing  a  proper  relaxation  coefficient  causes  SOR  method  to 
under-perform  QMR  above  1GHz.  Also,  the  QMR  curve 
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Figure  2:  Comparison  of  SOE  and  QMR  without  matrix 
transformation. 

extends  up  to  1015  Hz  to  verify  that  its  convergence  indeed 
improves  at  extremely  high  frequencies.  The  maximum  of 
iteration  number  in  the  QMR  curve  seems  to  be  closely 
related  to  the  ratio  of  maximum  and  minimum  values  of 
non-zero  entries  in  DDj1.  The  ratio  for  this  example  is 
about  130.  At  low  frequencies,  a  major  overhead  for  us¬ 
ing  the  matrix  transformation  is  the  need  to  carry  out 
a  Jacobi  matrix  decomposition  at  every  frequency  which 
dominates  the  CPU  time.  Since  the  matrix  transforma¬ 
tion  achieves  little  at  these  frequencies,  it  is  desirable  to 
use  it  only  when  needed.  In  this  regard,  the  magnitude  of 
vDDj1  seems  to  be  a  good  measure  since  it  determines 
how  far  the  transformation  matrix  deviates  from  the  unity 
matrix.  In  practice,  the  maximum  value  of  the  non-zero 
entries  of  uDDj1  is  used  as  the  figure  of  merit.  As  can  be 
seen  from  Figure  2,  without  matrix  transformation  QMR 
is  quite  competitive  with  SOR  when  both  methods  con¬ 
verge.  However,  SOR  fails  to  converge  well  below  1  GHz 
while  QMR  continues  to  converge  at  much  higher  frequen¬ 
cies.  In  fact,  until  about  5  GHz,  QMR  without  transfor¬ 
mation  outperforms  QMR  with  transformati  At  still 
higher  frequencies,  QMR  without  transformation  never 
stops  to  converge  but  the  rate  of  convergence  falls  off  very 
quickly.  From  these  results,  a  demarcation  vaiue  for  the 
figure  of  merit  of  about  0.3  has  been  selected  above  which 
the  matrix  transformation  is  carried  out. 

Application  to  GaAs  MESFET  Analysis 

GaAs  MESFETs  with  gate  length  of  1/im  and  width  of 
2  mm  have  been  fabricated  and  measured  for  h—  and 
5-parameters  for  various  frequencies  at  different  bias. 
Figure  3  shows  the  current  gain  comparison  at  Vos  —  5  V 
and  Vas  —  —0.5V.  For  all  the  data  available,  simula¬ 
tion  results  closely  tfack  -those  of  the  measurement,  and 


Figure  3:  Simulated  and  measured  current  gain  h]t  at 
Vos  =  5V  and  VGS  =  -0.5V. 

beyond  the  measurable  frequency  range  simulation  pre¬ 
dicts  the  device  characteristics  well  in  line  with  the  ex¬ 
pected  (extrapolated)  behavior  providing  an  estimated 
/t  of  about  14  GHz.  This  suggests  that  the  key  de¬ 
vice  behavior  such  as  hje  can  be  simulated  quite  well 
with  high  frequency  simulations  using  PISCES.  Figure  4 
shows  the  Smith  chart  of  5— parameters  at  Vos  =  2V 
and  Vos  =  0V.  In  this  case,  however,  there  are  some  dis¬ 
crepancies  between  the  measurement  and  simulation  re¬ 
sults.  The  fact  that  5n  goes  into  the  upper  half  circle  at 
high  frequencies  is  a  clear  indication  of  inductive  behav¬ 
ior  which  can  only  be  attributed  to  the  wire  inductance 
and  is  not  included  in  the  PISCES  simulation.  To  verify 
this  point,  MESFET  is  modelled  with  additional  para- 
sitics  around  the  intrinsic  device  and  the  complete  circuit 
mode]  is  shown  in  Figure  5.  By  using  the  y— parameters 
from  PISCES  simulation  for  the  intrinsic  device,  induc¬ 
tance  of  0.075  nH  have  been  added  to  the  gate  and  drain 
electrodes  (Lt  and  Li  in  Figure  5).  The  simulation  results 
thus  obtained  are  shown  in  Figure  4  with  squares.  Indeed 
a  much  better  agreement  is  achieved. 

Conclusions 

A  robust  iterative  algorithm  for  high  frequency,  small- 
signal  oc  analysis  is  presented,  which  combines  a  novel 
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Figure  4:  Comparison  of  measured  and  simulated  Su  and 
S22  at  Vqs  =  2V  and  Vqs  =  OV.  The  dark  symbols  are  for 
Su  and  open  ones  for  S22-  Circles  are  those  from  measure¬ 
ment,  triangles  from  PISCES  simulation  for  the  intrinsic 
device,  and  squares  from  the  mixed  device/circuit  simu¬ 
lation  including  lead  parasitics. 


matrix  transformation  scheme  and  a  memory-** 
iterative  method  -  TFQMR  (TVanspose-Fr^  q'***11 
Minimal  Residual).  Excellent  convergence  beh 
demonstrated  for  frequencies  up  to  1015  HZ  with**0*  ** 
quiring  increase  in  array  size  used  in  dc  analysis  n  l*  te' 
quency  analysis  of  high-speed  GaAs  MESFETs  has  h 
performed  using  the  new  algorithm  and  the  essential 
havior  of  the  device  is  well  emulated.  By  adding  .  . 
elements  to  the  intrinsic  device,  better  agreement  'tlC 
tained  in  the  5— parameters  of  the  device.  **  °^ 
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ABSTRACT 

Progress  in  technology  CAD  (TCAD)  tools  that  span  the  range  of  IC  circuit  design,  semicon¬ 
ductor  device  modeling,  and  material  processing  simulation  is  discussed.  Special  emphasis  is 
given  to  the  application  in  the  opto-electronic  devices  and  systems.  Promising  techniques  for 
the  mixed  device / circuit  simulation  are  presented  using  UC  Berkeley’s  SPICE  and  Stanford’s 
PISCES  programs.  Results  show  the  capability  to  optimize  the  circuit  timing  and  impedances 
to  match  the  internal  physical  effects  and  the  resulting  optical  properties  in  AlGaAs  based 
LEDs.  3D  solid  modeling,  a  new  approach  for  interactive  device  design  based  on  the  mask  lay¬ 
out  information,  is  discussed.  Automatic  gridding  to  support  a  unified  process/device  modeling 
capability  for  GaAs  structures  using  SUPREM-IV.GS  is  also  introduced.  Taken  as  a  whole, 
the  unification  of  these  TCAD  tools  provides  a  powerful  design  approach  for  the  world  of  the 
opto-electronic  ICs  (OEIC). 

1  Introduction 

Application  of  semiconductor  lasers  and  LEDs  in  all  aspects  of  electronic  systems  is  growing  rapidly.  Es¬ 
pecially  in  the  context  of  data  communications,  the  need  for  coupling  electronic  circuit  and  optical  device 
design  is  critical.  Based  on  the  experience  of  modeling  silicon  IC  processes  and  devices  over  the  past  two 
decade,  a  new  TCAD  tool  kit  targeted  at  the  opto-electronic  applications  is  being  developed  at  Stanford 
in  close  collaboration  with  industrial  partners  such  as  HP.  In  this  paper,  we  will  address  the  major  issues 
related  to  the  modeling  of  opto-electronic  devices  and  the  simulation  of  interaction  between  electronic  cir¬ 
cuits  and  opto-electronic  devices.  These  topics  are:  a  unified  model  for  carrier  transport  in  heterostructure 
with  varying  lattice/carrier  temperature,  implementation  of  2D  and  quasi-3D  device  simulation  for  OE- 
ICs,  framework  issues  related  to  gridding  and  solid  modeling,  and  mixed-mode  device/circuit  simulation. 
Application  examples  will  be  provided. 

2  DUET  Model  for  Carrier  Transport  in  Heterostructures  Including 
Temperature  Effects 

This  section  considers  the  electrical  modeling  of  the  opto-electronic  devices,  a  key  factor  in  overall  efficiency 
of  OEICs.  To  describe  the  carrier  transport  in  a  heterostructure  the  additional  driving  forces  caused  by 


changes  of  band  parameters  such  as  the  effective  mass,  electron  affinity,  and  bandgap,  have  to  be  taken 
into  consideration.  Moreover,  the  effects  of  carrier  degeneracy  (i.e.,  Fermi-Dirac  statistics  as  opposed  to 
the  Boltzmann  or  Maxwellian  distribution)  and  the  lattice/carrier  temperature  play  an  even  greater  role 
in  compound  semiconductors  than  for  the  elemental  semiconductors  such  as  silicon  because  of  the  poor 
thermal  conductivity  and  smaller  effective  density  of  states.  Following  Stratton's  original  approach  [1],  we 
have  developed  a  general  transport  model  based  on  the  moment  approach  to  solving  Boltzmann  transport 
equation  (BTE)  to  describe  the  carrier  transport  in  heterostructures  with  temperature  effects  included. 
Because  this  transport  model  covers  both  the  carrier  and  lattice  temperatures,  we  name  the  model  as 
DUET  for  dual  energy  transport. 

For  complete  modeling,  six  variables  are  needed  to  determine  uniquely  the  device  state  and  they 
are:  electrostatic  potential  (V*),  electron  and  hole  concentrations  (n  and  p),  electron  and  hole  carrier 
temperatures  (Tn  and  Tp),  and  the  lattice  temperature  ( Ti ).  Carrier  temperature  is  defined  through  the 
temperature  parameter  in  the  statistics  for  carriers  at  quasi-thermal  equilibrium,  which  is  the  Fermi-Dirac 
distribution  in  our  model,  and  the  lattice  temperature  is  a  measure  of  the  kinetic  energy  of  phonons.  The 
distribution  function  of  carriers  (/),  in  the  real  (r)  and  momentum  (p)  space  at  any  given  non-equilibrium 
state  is  constructed  using  the  quasi-thermal  equilibrium  distribution  function  (/o)  as  follows. 

/(r»  P)  =  /o(r,p)  -  r(r,p)  (^V/0  -  (1) 

where  a  constant  effective  mass  (m*)  is  assumed,  and  E'  is  the  kinetic  energy  and  for  electrons 

E'M  =  5^  =  E  -  Ec(r)  (2) 

with  E  the  total  energy  and  Ec  the  energy  at  the  conduction  band  edge.  /0  is  taken  from  the  Fermi-Dirac 
statistics  as 


where  both  the  quasi-Fermi  energy  (Q  and  electron  temperature  are  position-dependent,  and  fc#  is  the 
Boltzmann  constant.  F  is  the  field  acting  on  the  charged  carrier  and  for  electrons 

F  =  iv£0(r)  (4) 

The  formulation  for  holes  are  completely  analogous,  r  in  Eq.  (1)  is  the  microscopic  relaxation  time  which 
depends  in  principle  on  r  and  p,  but  is  usually  assumed  to  only  depend  on  the  carrier  energy,  E,  and  some 
other  position-dependent  scattering  mechanisms  such  as  impurity  scattering. 

Based  on  the  knowledge  of  the  distribution  function,  physical  quantities  such  as  the  carrier  concentration 
(n),  the  kinetic  energy  density  (te„),  the  electric  current  density  (jn),  and  the  kinetic  energy  flux  density 
(s„)  can  all  be  found  from  their  definitions.  Thus, 


n  =  J  f<?P  =  J  fod3p  =  NcF1/2(tj) 

(5) 

(6) 

j-  =  -^Jfpd3p 

3 

=  qDVn  -  -qnDV  In  m*  +  npVEc  +  qnDT  VT„  (7) 

=  n/<VC  +  qnfiQVTn  (8) 

*n  =  /  fpE'dp3  =  -Pnr„jn  -  KnVTn  (9) 

m  y 

where 

V  =  IC(*)  ~  Ec(x)]/kBTn  (10) 

F„  is  the  Fermi  integral  of  order  v,  and  all  other  symbols  and  coefficients  have  conventional  meaning  used 
in  solid  state  physics  and  can  be  derived  knowing  the  relaxation  time  and  distribution  function. 

The  equations  to  be  solved  are  the  Shockley  semiconductor  equations,  which  include  the  Poisson’s 
equation  and  carrier  continuity  equations,  and  the  energy  balance  equations  based  on  the  Fick’s  second 
law  for  the  kinetic  energy  of  carriers  and  the  lattice.  The  complete  set  of  equations  are  listed  below: 
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where  Tl  is  the  lattice  temperature,  and  and  are  energy  relaxation  times  for  electrons  and  holes, 
respectively,  and  they  characterize  the  carrier  kinetic  energy  loss  due  to  the  phonon  scattering. 

The  direct  energy  exchange  between  the  two  carrier  subsystems  is  through  Auger  recombination  and 
impact  ionization,  while  the  exchange  among  carrier  subsystems  with  lattice  is  through  Shockley-Read-Hall 
(SRH)  recombination  and  phonon  scattering.  The  carriers  also  release  energy  to  emit  photons  through 
radiative  recombination.  The  generation  of  electron-hole  pairs  due  to  the  absorption  of  photons  is  not 
taken  into  account  in  this  formulation.  For  the  simulation  of  laser  diodes,  often  the  carrier  temperature  is 
not  a  major  concern  but  the  lattice  temperature  is.  One  can  then  set  T„  =  Tp  =  Tl  and  Eqs.  (14)-(16) 
collapse  into  the  following  form: 


[ci«  +  |(n  +  p)*Bj  T  =  V  •  (klVT  -  sn 


“  sp )  +  jn  •  F„  +  jp  •  Fp  +  usRjjEg(T) 


(17) 


where  jn  *Fn  -f  jp  *FP  can  be  identified  as  the  Joule  heat  which  represents  the  conversion  of  the  electrostatic 
potential  energy  to  the  carrier  kinetic  energy,  and  the  term  usrhEs(T)  represents  the  conversion  of  the 
carrier  potential  energy  to  the  phonon  (kinetic)  energy  through  SRH  recombination. 

The  above  equations  and  expressions  are  good  for  any  dimensionality,  either  ID,  2D,  or  3D.  In  order  to 
solve  the  system  of  partial  differential  equations  as  described  in  this  section,  proper  boundary  conditions 
are  critical  -  especially  for  the  thermal  simulation.  We  will  not  detail  the  discussion  here,  but  the  interested 
readers  are  referred  elsewhere  [2]. 

3  Model  Implementation  in  2D  and  Quasi-3D  (Cylindrical 
Symmetry) 

The  DUET  model  has  been  implemented  in  Stanford’s  PISCES-2ET.  The  results  for  the  electrical  sim¬ 
ulation  of  heterostructure  devices  such  as  HBTs,  LEDs,  and  SELs  (surface  emitting  lasers)  have  been 
compared  to  the  measurement  data.  Currently  four  material  systems  are  available:  ternary  compounds 
AlfGai-fAs  and  AlrIni_xAs,  quaternary  compound  GazIni_xAsvPi-],,  as  well  as  binary  compound  sys¬ 
tem  for  GecSii_z.  The  base  materials  corresponding  to  these  compounds  are  GaAs,  InAs,  InP,  and  Si.  The 
material  parameters  have  been  calibrated  either  through  values  available  in  the  literature  or  by  the  mea¬ 
sured  data  obtained  from  HP.  More  than  one  material  system  can  be  incorporated  in  a  single  device  region. 
Non-stationary  transport  mechanisms,  i.e.,  those  where  the  concept  of  Fermi  level  doesn’t  apply,  such  as 
the  thermal  emission  and  tunneling  through  an  abrupt  barrier  have  not  been  implemented  currently. 

Although  PISCES- 2ET  is  a  2D  device  simulator  in  general,  in  the  special  case  of  cylindrical  symmetry, 
3D  simulation  can  also  be  performed.  This  applies  to  the  simulation  of  light  emitting  diodes  (LEDs)  which 
have  cylindrically  symmetrical  structure.  The  algorithm  for  the  cylindrical  coordinate  simulation  is  to 
sweep  the  element  in  the  2D  cross  section  around  the  central  axis  and  because  the  flux  calculated  in  2D 
plane  is  always  perpendicular  to  the  surface  of  the  3D  elementary  volume,  the  assembly  of  the  control 
volume  equation  is  essentially  the  same  as  in  the  2D  case.  Since  there  is  no  approximation  involved,  it  is 
a  rigorous  3D  simulation  in  this  case. 

In  the  following  we  show  the  simulation  of  a  LED  with  cylindrical  symmetry.  The  half  structure  is 
shown  in  Figure  1,  and  the  center  of  the  diode  is  on  the  right  side.  The  anode  contact  is  on  the  bottom 
and  the  cathode  contact  is  on  the  top  of  the  structure  with  a  ring  shape  to  allow  light  to  be  emitted  from 
the  central  opening  of  the  diode.  The  unique  feature  and  the  challenge  to  the  numerical  simulation  is  the 
existence  of  a  floating  n+  layer  in  the  p- type  region.  The  function  of  this  floating  layer  is  to  confine  the 
current  flow  in  the  central  region  so  that  the  emitting  efficiency  will  be  enhanced,  and  the  determination 
of  its  shape,  location,  and  doping  level  is  one  of  the  key  issues  in  LED  design.  From  the  numerical  point 
of  view,  the  heavily  doped  floating  layer  essentially  blocks  the  flow  of  the  majority  carriers  through  it  and 
hence  can  cause  severe  numerical  instability.  There  are  several  ways  to  get  around  of  this  problem;  some 
involve  elaborate  algorithms.  In  our  current  approach,  we  simply  attach  an  external  contact  to  the  layer 
and  connect  a  current  source  to  the  ground  node.  If  the  current  source  is  set  to  zero  value,  it  is  essentially 
an  open  circuit  condition.  This  scheme  works  very  well  when  the  diode  current  has  a  substantial  value,  i.e. 
for  the  on  state.  When  the  diode  is  turned  off,  the  convergence  of  numerical  iterations  becomes  more 
difficult  but  the  exact  results  have  less  concern  to  the  designer.  The  simulated  I  —  V  characteristics  are 
shown  in  Figure  2.  In  the  later  section,  we  will  show  how  this  LED  is  used  in  a  transmitter  circuit  for 
data  communication  and  the  mixed-mode  device/circuit  simulator  will  be  invoked  to  find  the  frequency 
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Figure  1:  Device  structure  of  cylindrically  symmetric  LED.  The  symmetric  axis 
simulation  region. 
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Figure  2:  Simulated  I —V  characteristics  of  LED.  Notice  that  the  bias  is  applied  to  the  top  of  the  structure 
(n-layer),  so  it  is  negative. 
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Figure  3:  The  framework  of  tools  used  in  IC/OEIC  device/ circuit  design. 

response. 

4  OEIC  Design  Framework  -  3D  Solid  Modeling  and  Gridding 

The  design  of  OEICs  requires  not  only  the  consideration  of  devices  with  complex  material  structures  as 
shown  in  Figure  1  but  also  the  integration  of  circuit  functionality  that  meets  the  electrical  constraints 
needed  for  proper  system  operation.  Moreover,  there  is  growing  emphasis  on  tightly  integrated  systems 
and  even  monolithic  solutions.  For  these  purposes  the  use  of  device/IC  mask  layout  information  as  well  as 
3D  solid  geometry  models  are  of  great  interest  as  utilities  for  both  visualizing  and  manipulating  the  designs. 
Figure  3  shows  key  portions  of  a  design  framework  for  OEIC  applications  that  utilizes  both  commercially 
available  tools  and  custom  utilities  for  supporting  functional  analysis  of  an  OEIC  subsystem.  The  layout, 
solid  modeling,  and  circuit  extraction  information  come  from  commercial  tools  (i.e.,  Cadence  and  ACIS). 
Discussion  of  process  modeling  is  beyond  the  scope  of  this  paper  and  relevant  materials  can  be  found  in  [3]. 
The  following  discussion  focuses  on  the  issue  of  solid  geometry  modeling  and  gridding.  The  next  section 
considers  the  issues  of  mixed-mode  simulation. 

As  indicated  in  Figure  3,  3D  structures  created  by  solid  modeling  can  be  of  great  value  in  designing 
OEICs.  The  concept  of  the  solid  modeling  includes  the  visualization  of  the  actual  structure  with  constituent 
layers  in  a  3D  perspective,  meshing  on  the  surface,  dissection  of  the  structure  from  different  view  angles, 
and  the  simulation  of  the  structure  either  in  a  2D  cross  section  or  with  a  truly  3D  simulation.  Solid 
modeling  has  been  very  useful  in  diagnosis  of  high  density  structure  such  as  memory  cell  design  and  in  the 
analysis  of  parasitic  components  that  are  frequently  encountered  in  high  speed  microwave  circuits. 

Figure  4  shows  a  3D  rendering  of  a  six  transistor  (6T)  SRAM  cell  where  the  interconnection  layers  are 
easily  recognized.  This  figure  clearly  indicates  the  complexity  of  the  structure  that  can  easily  be  created 
and  analyzed  using  the  utilities  and  information  models  presented  in  Figure  3. 


Figure  4:  3D  rendering  of  six-transistor  SRAM  cell  using  solid  modeling.  The  cut  plane  is  for  the  2D  cross 
section  later  used  for  device  simulation. 

One  of  the  major  obstacles  towards  the  3D  device/process  simulation  is  the  discretization  of  the  object 
in  space,  i.e.  gridding  or  meshing.  Especially  for  the  opto-electronic  devices,  there  are  multi-layers  in  the 
structure  with  different  material  properties  and  doping  levels.  The  interface  between  different  layers  often 
changes  abruptly.  We  have  developed  various  automatic  mesh  generators  to  facilitate  the  simulation  of  IC 
devices  and  processes.  The  most  recent  development  is  a  program  called  FOREST  [4],  which  is  targeted 
at  the  auto-meshing  of  complex  multi-layer  device  and  process  simulation.  In  the  following  we  describe 
briefly  the  algorithm  and  feature  of  the  program.  An  illustrative  example  will  be  given  to  show  the  steps 
in  meshing  process  and  the  capability  of  the  program. 

FOREST  is  based  on  the  quadtree  decimation  algorithm,  which  guarantees  high  quality  of  the  leaf  ele¬ 
ments  -  triangles  in  the  2D  simulation.  FOREST  considers  only  2D  structures,  and  all  triangles  generated 
from  it  are  either  acute  or  right  with  aspect  ratio  as  close  to  one  as  possible.  Fineness  of  the  space  partition 
is  controlled  and  can  be  dynamically  adapted/refined  according  to  a  number  of  criteria  such  as  the  doping 
profile  and  solution  error  tolerance.  Besides  its  application  in  device  simulation,  for  which  the  boundary 
of  the  structure  is  usually  fixed,  the  program  is  also  suitable  for  the  multi-layer  process  simulation  where 
the  topography  undergoes  rapid  changes  (e.g.  SUPREM-IV.GS).  A  graphics  user  interface  is  also  provided 
with  the  program. 

Figure  5  shows  how  FOREST  grids  in  a  step-by-step  manner  a  complex  structure  with  non-planar 
surfaces  and  even  interior  (or  back-side)  regions.  This  particular  example  comes  from  silicon  process 
simulation  (i.e.,  SUPREM-IV.GS)  and  gives  emphasis  to  the  need  for  not  only  boundary  definitions  but 
also  wide  dynamic  range  in  feature  sizes.  The  issue  of  gridding  backside  or  interior  regions  (i.e.  air  bridges) 
can  be  especially  important  for  multi-chip  and  other  packaging  concerns  such  as  OEIC  connectors. 


Figure  5:  Process  of  FOREST  mesh  generation  for  a  complex  device  structure  with  hole  in  it. 

5  Mixed  Device/Circuit  Simulation  of  Optical  Data  Communication 

The  performance  of  an  electronic / opto-electronic  system  depends  not  only  on  that  of  the  individual  devices, 
but  also  on  the  optimized  design  of  peripheral  driving  circuitry  and  minimization  of  effects  from  parasitics. 
Thus  to  evaluate  a  system  design  circuit  level  simulation  is  required.  However,  the  circuit  simulation  uses 
analytical  (compact)  models  for  all  non-linear  devices,  which  axe  not  always  available  when  a  new  device 
is  emerging.  Especially  for  opto-electronic  devices  where  the  optical  performance  is  based  on  the  electrical 
response,  and  compact  models  are  often  either  not  yet  developed  or  not  easily  characterized.  We  have 
developed  a  modularized  mixed  device/circuit  simulator  where  the  device  simulator  provides  a  means  to 
model  complex  device  structures  numerically  in  concert  with  the  circuit  simulation  [5].  The  shell  of  the 
mixed  simulator  is  SPICE  3  from  UC  Berkeley,  and  device  simulator  is  “plugged  in”  as  a  loosely  coupled 
module  -  currently  PISCES-2ET  is  used  in  this  work,  but  other  device  simulators  can  also  be  used  as  long 
as  they  provide  not  only  the  DC  I  —  V  characteristics  but  also  the  slope  (more  broadly,  y— parameters) 
information. 

Although,  several  examples  have  been  demonstrated  for  mixed  device/circuit  simulation  from  various 
sources,  either  academically  or  industrially,  they  all  focus  on  electronic  components.  We  present  here  a 
simulation  of  drive  circuitry  in  an  optical  transmitter  using  an  LED  to  link  to  optical  fiber.  The  key 
requirement  of  this  configuration  is  the  fast  optical  response  of  the  LED,  which  can  be  characterized  by 
the  conductive  (diffusive)  current  component  in  the  ac/transient  response.  The  detailed  circuit  is  shown 
in  Figure  6  and  the  LED  has  the  same  structure  as  the  one  used  in  the  device  simulation.  The  exact 
structure  is  used  in  the  mixed  simulation.  The  LED  is  driven  by  the  ECL  output  either  as  sinusoidal  or 
square  wave.  The  waveform  for  the  current  through  the  LED  from  the  time  transient  analysis  is  shown 
in  Figure  7  and  comparison  is  made  to  the  analytical  diode  model.  In  this  case  the  compact  model  has 
been  experimentally  optimized  so  the  fit  with  the  numerical  model  is  quite  good.  For  new  generations  of 
technology  where  experimental  data  is  not  available,  the  use  of  this  mixed-mode  analysis  is  more  critical. 


Figure  6:  Driving  circuit  for  LED  linked  to  optical  fiber  for  high-speed  data  communication 
(courtesy  of  HP). 
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Figure  7:  Comparison  of  simulated  waveforms  from  analytical  and  numerical  models  in  circuit  simulation. 


6  Conclusions 


This  paper  has  considered  a  range  of  issues  involved  in  the  simulation  of  OEIC  devices  and  circuits.  Starting 
from  the  issues  of  electrical  behavior  of  opto-electronic  devices,  the  details  of  a  new  dual  energy  transport 
(DUET)  model  have  been  presented.  Key  features  of  thi6  new  capability  include  the  ability  to  couple  lattice 
temperature  with  the  carrier  temperature  system  of  equations.  Especially  for  compound  material  systems 
with  poor  thermal  conductivity,  this  coupled  analysis  is  of  growing  importance.  Another  unique  analysis 
capability  that  is  demonstrated  in  this  paper  is  that  of  mixed-mode  simulation  where  optoelectronic  effects 
can  be  simulated  at  the  device  level  in  concert  with  conventional  circuit  simulation  of  the  purely  electrical 
effects.  In  this  way  the  details  of  the  optical  and  electronic  couplings  can  be  analyzed  and  correlated  directly 
with  the  materials  and  geometry  effects  rather  than  by  means  of  compact  equivalent  circuit  models  that 
are  difficult  to  characterize.  Finally,  some  of  the  issues  related  to  OEIC  design  frameworks  have  been 
presented.  Automated  gridding  of  devices  and  3D  geometry  modeling  that  facilitates  visualization  and 
extraction  of  parasitic  interactions  have  been  presented.  The  information  models  for  such  applications 
and  the  availability  of  commercial  tools  (ie  Cadence,  ACIS,  AVS  etc.)  provide  a  powerful  environment 
from  which  to  develop  a  more  specific  set  of  applications  and  other  utilities  to  support  the  development  of 
OEICs.  This  paper  and  the  associated  examples  provide  motivation  to  go  further  in  the  development  of 
OEIC  design  system  based  on  the  growing  power  of  such  information  models  and  tools. 
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Abstract 


Progress  in  algorithms  and  simulators  that  exploit  parallel  computers  are  reported.  Results  using  two 
generations  of  the  Intel  iPSC  architecture  for  device  analysis  of  2D  and  3D  bipolar  problems  are  used  to 
illustrate  substantial  progress  being  made  in  parallelization.  New  approaches  in  the  areas  of  hydrodynamic  and 
Monte  Carlo  analysis  are  also  discussed. 

Introduction 

Over  the  past  decade  the  computational  demands  of  TCAD  have  grown  as  have  the  power  and  availability  of 
new  parallel  computers.  Specifically,  requirements  for  large  3D  process  and  device  simulations  with  grid 
numbers  reaching  several  million  have  become  essential —especially  to  analyze  parasitic  effects  where  multiple 
devices  are  involved.  On  the  hardware  side,  tightly  coupled  shared  memory  machines  (Cray  and  Alliant  for 
example)  have  given  way  to  distributed  memory  machine  (Intel  and  Thinking  Machines)  and  now  experimental 
architectures  that  attempt  to  provide  distributed  and  shared  memory-far  example  the  DASH  Project  at  Stanford 
are  being  pursued.  The  focus  of  this  talk  is  primarily  aimed  at  summarizing  experiments  at  Stanford  in  2D  and 
3D  device  simulation  matte  over  the  past  several  years  using  parallel  computers.  This  involves  a  variety  of 
algorithmic  approaches  including:  drift-diffusion,  Monte  Carlo  and  most  recently  hydrodynamic  formulations. 
The  primary  class  of  computers  used  in  this  study  was  the  Intel  iPSC  series  (including  the  commercial  i860  and 
the  experimental  "Delta”  machine  at  Caltech).  Results  show  very  promising  performance  and  potential  for 
achieving  TeraFLOP  performance  using  1000s  of  processors  within  a  few  years. 

Physical  Models  and  Algorithms 

There  is  a  range  of  physical  models  that  can  be  applied  to  semiconductor  device  modeling  problem.  Basically 
there  are  three  major  classes  of  ways  to  solve  the  Boltzmann  Transport  Equations  using: 

1 .  The  first  moment  (drift-diffusion  or  D/D) 

2.  Higher  order  moments  (hydrodynamic(HD) ,  energy  transport  (ET)...) 

3.  Monte  Carlo  (MC)  as  a  means  to  directly  build  the  statistics. 

In  this  talk  we  will  give  examples  of  how  each  of  these  approaches  can  exploit  parallel  computation. 

Drift-diffusion  (D/D) — In  the  case  of  the  classical  D/D  formulation,  we  have  previously  reported  results  on 
parallelization  of  the  3D  STRIDE  code.  Here  we  repot  parallelization  of  the  well-known  2D  PISCES  code. 
Figure  la  illustrates  all  steps  needed  to  parallelize  the  PISCES  code  and  the  associated  percentages  of  time 
required  in  the  simulation.  Clearly,  the  matrix  solution  part  is  a  major  but  not  dominant  fraction.  Also,  one  can 
note  that  parallelization  of  assembly  of  matrix  elements  is  every  bit  as  important  in  the  overall  effort.  Figure  lb 
shows  a  profile  of  actual  run  times  (wall  clock  time)  versus  the  number  of  processors  on  the  Intel  iPSC/i860  for 
two  problems.  For  comparison  the  times  on  a  single  CPU  SUN  4/670  are  also  listed.  Note  that  for  the  27.600 
equation  problem,  the  speed-up  factors  are  substantial.  Also  note  that  beyond  16  nodes  for  this  particular 
problem  the  speed-up  factor  degrades  due  to  communications  limitations.  Further  details  will  be  discussed  in 
thepresentatioa 

In  previous  papers  we  have  discussed  progress  in  3D  D/D  modeling  using  the  prototype  STRIDE  code  [1][2], 
Ova  the  past  year  we  have  scaled  the  problem  size  and  numba  of  processors  used  in  the  analysis  to  much  larger 
numbers  than  previously  reported.  Figure  2  shows  the  wall-clock  time  and  grid  density  (two  y-axes)  versus  the 
numba  of  processors  used  on  the  Delta  machine  at  Caltech.  This  is  an  experimental  version  of  the  Intel  iPSC 
architecture  with  a  backplane  interconnect  designed  at  Caltech.  As  can  be  observed  from  the  data,  we  see 
excellent  performance  and  efficiency  even  up  to  more  than  500  processors  and  grid  density  approaching  5 
Million  for  a  bipolar  example.  The  sustained  performance  was  1.7  GFlops.  In  this  talk  we  will  discuss  some  of 
the  algorithms  used  with  regards  to  preconditioning  that  are  essential  in  achieving  such  favorable  results. 


Hydrodynamic  (HD)  and  beyond — The  limitations  of  the  D/D  formulation  are  well-known  and  more* 
advanced  models  such  as  die  HD  or  ET  formulations  are  being  used  extensively.  From  a  physical  modeling 
point  of  view  we  continue  to  extend  the  ET  approach  [3].  From  a  numerical  point  of  view  we  are  investigating 
the  HD  formulation  in  the  context  of  major  advances  in  the  CFD  domain  [4].  Specifically,  we  are  using  the 
time-space  GLS  formulation  in  concert  with  the  HD  carrier  transport  equations  and  developing  a  parallelized 
version  of  the  ENSA  solver  code  (Euler  Navier-Stokes  Analyzer).  Since  a  paper  on  this  subject  has  been 
submitted  to  this  conference,  we  will  only  briefly  highlight  the  results. 

Monte  Carlo  (MC) —  is  a  still  more  complete  physical  approach  to  solving  the  BTE.  While  there  are  many 
limitations  in  applying  MC  analysis  due  to  boundary  conditions,  especially  across  heterojunction  and  dielectric 
interfaces,  there  are  many  aspects  of  hot  carrier  phenomena  that  can  be  effectively  solved  using  MC  analysis. 
At  S1SDEP  we  have  presented  results  of  efforts  in  parallelization  of  MC  using  the  University  of  Bologna’s 
BEBOP  code  [5],  Further  efforts  in  this  direction  are  being  reported  by  the  Matsushita  group.  The  basic  factors 
affecting  parallelization  of  MC  analysis  are  straightforward  from  an  algorithmic  point  of  view.  On  the  other 
hand,  there  are  still  many  innovations  that  are  emerging  to  use  either  harmonic  expansions  or  alternate 
formulations  for  the  variables,  including  symmetry  relationships.  The  discussion  of  MC  analysis  will  center  on 
promising  new  approaches. 

Discussion 

The  above  examples  illustrate  the  progress  that  is  being  made  at  three  levels  in  developing  algorithms  that 
exploit  new  parallel  computers.  The  efforts  in  parallelization  of  the  classical  D/D  formulation  show  that 
substantial  speed-ups  and  robust  convergence  on  very  tough  bipolar  problems  can  be  achieved  well  into  the 
multi-million  grid  domain.  As  for  higher  moments  of  the  BTE,  we  see  two  important  trends.  First,  the 
development  of  new  algorithms  such  as  the  GLS  formulation  used  in  CFD  and  its  parallelization  should  yield 
major  benefits.  Second,  the  opportunities  to  advance  Monte  Carlo  methods  are  far  horn  "out  of  gas”  in  terms  of 
innovation  and  computational  enhancements.  A  final  area  of  discussion  in  this  paper  is  the  growing  importance 
of  overall  support  in  the  development  and  parallelization  of  TCAD  codes.  The  most  obvious  of  these  needs  is  in 
the  area  of  gridding  and  the  domain  decomposition.  Other  areas  include:  user  interfaces,  control  of  both 
nonlinear  and  linear  solvers  and  alternative  formulations.  These  issues  are  related  to  frameworks,  standards  and 
code  development  strategies. 

Conclusions 

Progress  in  developing  TCAD  algorithms  and  particularly  device  analysis  codes  using  parallel  computation  is 
reported.  Excellent  efficiency  and  convergence  behavior  on  2D  and  3D  bipolar  problems  is  demonstrated  using 
production  and  experimental  versions  of  the  Intel  iPSC  architecture.  Progress  in  solving  higher  moments  of  the 
BTE  are  reported  and  new  algorithmic  advances  are  discussed.  Finally,  challenges  of  overall  software 
engineering  are  considered. 
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Figure  1(a):  Structure  of  PISCES-MP  showing  how  different  portions  of  the  code  affect  execution  and  resulting 
parallelization  requirements 
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Figure  1(b):  Performance  of  PISCES-MP  showing  wall-clock  time  versus  number  of  CPUs  on  the  Intel 
iPSC/1860  for  two  bipolar  problems  (9000  and  27,600  equations).  Benchmarks  for  same  problems  on  a  SUN 
4/670  are  also  shown. 


Figure  2:  Solution  time  and  number  of  grid  in  millions  as  a  function  of  number  of  CPU  nodes  on  the  Intel  Delta 
machine  at  Caltech.  The  problem  is  a  3D  bipolar  simulation  and  the  code  used  in  the  experimental  STRIDE 
program  flj. 
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Abstract— A  3-D  device  solver  (STRIDE),  capable  of  solving 
grids  op  to  250  000  nodes,  has  been  developed  on  a  message 
passing  multiprocessor.  By  tbe  use  of  iterative  matrix  solvers 
and  Gummel  style  nonlinear  iteration  schemes,  user  memory 
per  node  is  reduced  over  use  of  direct  solvers  and  Newton 
schemes.  By  using  an  independent-edge-grouping  scheme  to  in¬ 
crease  the  vector  length  to  the  order  of  the  number  of  variables, 
the  vector  processing  efficiency  is  significantly  increased  with¬ 
out  additional  floating  point  operations.  We  extend  the  modi- 
fied-singular-perturbation  (MSP)  scheme  to  two-carrier  simu¬ 
lations.  This  significantly  speeds  up  the  convergence  rate  of 
Gummel  style  nonlinear  iterations.  Physical  insight  gained  from 
the  MSP  schemes  also  leads  to  an  automatic  switching  scheme 
between  various  nonlinear  schemes  based  on  the  monitoring  of 
certain  matrix  parameters.  This  allows  the  incorporation  of  a 
previously  proposed  Newton-lC  scheme  which  offers  the  best 
CPU  performance  for  normal  bipolar  simulations.  When  com¬ 
bined  with  current  convergence  criterion,  a  set  of  MSP  inspired 
convergence  criterion  are  better  able  to  recognize  a  practically 
converged  solution.  A  novel  global  convergence  scheme  is  also 
developed  based  on  insight  from  MSP  principles.  Interactive 
user  interface  and  links  to  graphics  tools  are  provided  to  sup¬ 
port  the  tool  integration  efforts.  Application  of  STRIDE  is  dem¬ 
onstrated  by  an  analysis  of  latchup  trigger  current  dependence 
on  layout  arrangement. 


I.  Introduction 

WITH  THE  continuing  miniaturization  of  integrated 
circuits,  3-D  effects  significantly  impact  device 
characteristics.  A  robust  and  efficient  3-D  device  solver 
will  give  device  engineers  significant  leverage  in  pursuing 
state-of-the-art  IC  technologies.  Various  3-D  simulators 
(e.g.,  [13—14])  have  appeared  to  address  these  needs. 
However,  one  of  the  major  hurdles  which  has  presented 
widespread  use  of  3-D  device  simulation  is  the  vast 
amount  of  computational  resources  required  for  such  an 
endeavor,  as  the  number  of  variables  can  easily  run  into 
hundreds  of  thousands,  or  even  millions.  Multiproces¬ 
sors,  which  connect  together  a  large  number  of  inexpen¬ 
sive  processors,  provide  a  cost-effective  platform  for 
CPU-intensive  3-D  simulations.  To  explore  the  potential 
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for  a  cost  effective  3-D  device  simulator,  we  have  devel¬ 
oped  STRIDE  (Stanford  ThRee  dimensional  DEvice  sim¬ 
ulator)  on  a  message  passing  multiprocessor  (Intel  iPSC2). 
This  paper  describes  the  progress  that  has  been  made  since 
the  previous  report  [S],  and  mainly  centers  around  the 
various  computational  aspects  with  special  emphasis  on 
bipolar  simulations.  Our  experience  in  3-D  visualization 
is  also  discussed. 

Section  II  gives  an  overview  of  the  device  solver.  Sec¬ 
tion  III  discusses  schemes  which  increase  the  vector  length 
to  the  order  of  number  of  variables  for  the  sparse  matrices 
encountered  in  3-D  simulation.  In  Section  IV,  various 
modified  singular  perturbation  (MSP)  schemes  are  intro¬ 
duced  for  two-carrier  simulations  which  significantly  im¬ 
prove  the  convergence  of  Gummel  style  nonlinear  itera¬ 
tions.  The  results  of  a  previously  proposed  Newton- 1C 
scheme  will  be  presented  which  offers  the  best  CPU  per¬ 
formance  with  less  memory  than  the  full-Newton  scheme. 
A  MSP  inspired  matrix  parameter  will  be  introduced 
which  allows  a  switching  scheme  that  automatically 
chooses  the  best  nonlinear  scheme  for  the  simulation.  In 
Section  V,  other  applications  of  MSP  principles  will  be 
discussed  which  include  a  new  set  of  convergence  crite¬ 
rion  capable  of  determining  practically  converged  solu¬ 
tions  and  a  novel  global  convergence  scheme.  Section  VI 
discusses  our  approaches  in  developing  better  user  inter¬ 
faces  and  on  tool  integration  aspects.  Section  VII,  pre¬ 
sents  an  application  example  of  STRIDE  in  the  analysis 
of  the  latchup  trigger  current’s  dependence  on  electrode 
arrangement.  Finally,  conclusions  are  drawn  in  Section 
VIII. 

II.  Overview  of  STRIDE 

In  STRIDE,  up  to  two  current  continuity  equations  are 
solved  together  with  Poisson's  equation.  In  normalized 
form,  these  equations  are  given  by 

V  •  (eV*)  =  n  -  p  +  N<  -  ND  (1) 

V  •  Jn  -  U  =  0  (2) 

V  •  Jp  +  U  =  0  (3) 

where  n  =  n,t  exp  (i  -  </>„),  p  =  nlt  exp  (<t>p  -  \t),  J„  - 
-qn„nV<bm,  and  Jp  =  qpppV<t>p.  The  normalization  con¬ 
stants  used  to  obtain  ( 1 )— (3)  are:  thermal  voltage  ( kT/q ) 
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for  electrostatic  potential  \p  and  quasi-Fermi  levels  4>„  and 
4>p,  intrinsic  carrier  concentration  n,  for  carrier  and  im¬ 
purity  concentrations,  and  the  intrinsic  Debye  length 
yftkT/q 2nj.  Effective  intrinsic  carrier  concentration  nie  is 
obtained  using  the  Slotboom  bandgap  narrowing  model 
[6].  Boltzmann  statistics  are  assumed  as  can  be  seen  in 
the  formula  for  n  and  p.  Tabulated  doping-dependent  mo¬ 
bility  values  are  used.  Tangential  held  dependent  mobility 
is  implemented  with  the  Caughey-Thomas  model  [7] . 

The  main  development  vehicle  for  STRIDE  has  been  a 
message  passing  hypercube— the  Intel  iPSC2.  The  advan¬ 
tages  of  the  hypercube  architecture  are  that  it  scales  to 
massively  parallel  systems  and  that  the  diameter  of  the 
system  (average  communication  delay  between  the  pro¬ 
cessors)  grows  only  logarithmically  with  the  number  of 
processors.  An  important  feature  of  the  Intel  hypercube  is 
the  amount  of  the  memory  per  processor.  The  system  used 
in  this  work  has  16  processors,  each  with  8  Mbytes  of 
memory.  The  sustainable  performance  for  the  system  is 
about  1 .5  MFlops  [8]  in  which  each  processor  constitutes 
an  Intel  386  paired  with  a  387  math  coprocessor.  The  de¬ 
velopment  has  recently  been  shifted  onto  an  iPSC/860 
system  which  has  32  processors  each  with  16  Mbytes  of 
memory.  Preliminary  results  have  shown  a  system  per¬ 
formance  of  approaching  100  MFlops.  STRIDE  also  runs 
on  Convex  Cl  and  Cray  YMP. 

The  simulation  domain  is  currently  approximated  by  a 
3-D  rectangular  grid  with  provisions  for  nonplanar  struc¬ 
ture  [5],  Work  is  going  on  to  develop  parallel  algorithms 
for  dealing  with  general  grids  generated  by  grid  genera¬ 
tors  such  as  OMEGA  [9],  Equations  (l)-(3)  are  discre¬ 
tized  using  the  finite  difference  method.  In  discretizing 
the  continuity  equations,  Scharfetter-Gummel  current 
formulation  [10]  is  used. 

The  discretization  of  ( 1)— (3)  yields  a  nonlinear  system 
of  algebraic  equations  which  are  solved  by  one  of  several 
nonlinear  iteration  schemes  implemented  in  STRIDE.  For 
each  nonlinear  iteration  in  Gummel’s  scheme,  the  discre¬ 
tized  Poisson's  equations  4>„,  4>p)  =  0)  are  solved 

for  the  update  vector  6\p  holding  and  <t>p  constant. 1  This 
is  achieved  by  repeatedly  solving 

=  -F+  (4) 

given  the  current  estimate  of  \p,  4>n  and  $p.  In  (4),  s 
(dFj,(\J/,  4>„,  $p)/d\j/)  is  called  the  main  matrix  of  Pois¬ 
son’s  equation.  Other  matrices  are  similarly  defined.  The 
discretized  current  continuity  equations  (F+„(iA,  4>„,  4>p)  = 
0  and  F*r(^,  4>n,  <J>p)  =  0)  are  then  solved.  Since  and 
A*r*r  are  linear,  one  matrix  solution  will  suffice.  This  pro¬ 
cess  repeats  until  convergence  is  achieved.  For  other  non¬ 
linear  schemes,  two  of  the  equations  are  solved  together 
while  the  other  is  solved  separately.  More  details  of  these 
schemes  are  discussed  in  Section  IV.  The  convergence 
criteria  are  the  maximum  magnitude  of  \j/  updates,  ter¬ 
minal  current  conservation,  and  relative  change  in  the 

and  $p  indicates  the  use  of  Slotboom  variable  in  the  continuity  equa¬ 
tion 


magnitude  of  terminal  currents  and  of  terminal  charges. 
Further  discussions  are  deferred  to  Section  V. 

The  matrix  solutions  are  the  most  CPU  time  intensive 
steps  in  STRIDE.  The  incomplete  Cholesky  conjugate 
gradient  (ICCG)  algorithm  [11]  is  used  to  solve  the  sym¬ 
metric  matrices,  while  asymmetric  matrices  are  solved 
using  the  incomplete  LU  decomposition  conjugate  gra¬ 
dient  squared  (ILUCGS)  algorithm  [12].  The  parallel  im¬ 
plementation  of  these  algorithms,  which  are  based  on  do¬ 
main  decomposition,  are  described  in  [13]  and  [14].  The 
parallel  efficiency  achieved  by  these  algorithms,  while 
running  on  16  processors,  is  more  than  80%:  when  the 
problem  size  exceeds  50  000  nodes. 

The  maximum  number  of  grid  points  that  can  be  han¬ 
dled  by  STRIDE  on  the  16-node  iPSC/2  system  is  over 
100  000, 3  which  translates  to  a  cubic  grid  of  47  points  in 
each  dimension.  This  is  the  direct  result  of  not  using  the 
full-Newton  scheme  which  would  nearly  double  the  mem¬ 
ory  per  node.  CPU  time  per  bias  point  is  about  1.5  h  for 
a  70K  node  bipolar  example.  This  is  averaged  from  a  Ic 
versus  curve  with  Vce  =  5  V.  In  this  curve,  V ^  in¬ 
creases  from  0.4  to  1  V  in  0.1 -V  steps. 

III.  Vectorization  Schemes 

Vectorization  is  an  important  aspect  of  reducing  the  ex¬ 
ecution  time  of  the  program.  Since  a  majorit;  of  CPU 
time  is  spent  solving  matrices,  our  efforts  have  concen¬ 
trated  on  vectorizing  the  iterative  matrix  solvers. 

The  principle  behind  the  vectorization  is  to  group  to¬ 
gether  long  chains  of  repetitive  operations  which  are  mu¬ 
tually  independent.  This  independence  is  essential  so  that 
vector  processing  will  not  produce  different  results  from 
the  scalar  operations.  Thus  the  key  to  vectorization  is  to 
identify  such  groups  of  operations.  For  most  iterative  ma¬ 
trix  solution  algorithms,  most  of  the  operations  involve 
vector-vector  or  matrix-vector  products.  Although  the 
former  is  trivially  vectorized,  the  latter  takes  some  effort 
when  the  matrices  are  sparse.  A  matrix-vector  product 
can  be  considered  to  be  the  sum  of  many  vector-vector 
products  which  can  be  easily  vectorized.  This  works  well 
for  dense  matrices  which  have  long  rows.  However,  when 
the  matrix  is  sparse,  the  length  of  these  vectors  becomes 
very  short  (typically,  three  to  six)  which  seriously  impedes 
vector  processing  performance. 

One  approach  to  increase  the  vector  length  is  to  split 
the  matrices  into  many  small  dense  matrices  obtained  from 
the  elements  of  the  simulation  domain,  such  as  triangle 
elements  in  2-D  simulation  [15].  When  two  elements  con¬ 
tain  no  common  node,  their  matrices  are  independent  and 
can  be  grouped  together.  This  grouping  can  be  called  in¬ 
dependent  element  grouping. 

Building  upon  this  idea,  we  implemented  an  indepen¬ 
dent  edge  grouping  scheme.  In  terms  of  group  theory,  a 

Previously,  we  have  reported  a  parallel  efficiency  of  about  60%  when 
the  concurrent  ICCG  algorithm  ran  on  iPSC  The  improvement  in  effi¬ 
ciency  is  a  result  of  the  ten-fold  improvement  in  the  data  latency  for 
iPSC/2  than  iPSC. 

’The  maximum  grid  count  is  increased  to  more  than  250  000  on  the  new 
32-node  iPSC/860  system  with  16  Mbytes  per  node 


matrix  element  {Atj)  can  be  considered  as  an  edge  between 
the  row  node  (i)  and  the  column  node  (j).  When  two 
edges  contain  no  common  node,  the  matrix- vector  oper¬ 
ations  they  represent  are  independent  and  can  be  grouped 
together.  Due  to  the  above  restriction,  there  can  be  only 
one  edge  that  contains  a  particular  node  in  each  group. 
Thus  the  minimum  number  of  groups  is  the  maximum 
number  of  edges  a  node  has.  For  a  seven-point  stencil, 
this  number  is  six  (the  diagonal  elements  are  already 
grouped  together  in  the  sparse  matrix  data  structure).  The 
grouping  is  achieved  by  a  greedy  algorithm  searching 
through  the  edges  represented  in  the  sparse  matrix  point¬ 
ers.  So  far,  optimal  grouping,  in  the  sense  that  six  groups 
are  sufficient  to  cover  all  the  edges,  has  always  been 
achieved  with  our  present  ordering  schemes  of  the  nodes. 
The  average  size  of  the  groups,  which  equals  the  average 
vector  length,  is  about  half  the  number  of  nodes. 

When  compared  with  the  element  grouping,  the  edge 
grouping  has  the  advantage  of  not  requiring  extra  double 
precision  storage  and  extra  floating  point  operations.  Fur¬ 
thermore,  it  is  compatible  with  the  parallel  implementa¬ 
tion  of  the  matrix  solver  on  the  hype^ube  [14].  The  dis¬ 
advantage  is  that  more  indirect  addressing  is  needed  which 
slows  down  the  vector  operations.  This  disadvantage  is 
partially  alleviated  by  re-ordering  the  matrix  elements  so 
that  the  indirect  addressing  is  not  needed  to  access  the 
matrix.  This  change  has  resulted  in  a  25%  increase  in  CPU 
performance  for  Cray  YMP,  which  is  also  the  average 
improvement  seen  on  Convex  Cl. 

With  the  matrix-vector  product  vectorization  issue  re¬ 
solved  by  the  above  mentioned  schemes,  all  the  opera¬ 
tions  in  the  conjugate  gradient  algorithm  are  now  well 
vectorized.  Extra  efforts  are  needed  to  vectorize  the  op¬ 
erations  involving  the  IC  or  ILU  preconditioner  which  ac¬ 
count  for  more  than  one  third  of  the  total  operation  counts. 
A  well-known  scheme  is  to  color  order  the  nodes.  Col¬ 
oring  divides  the  nodes  into  several  groups  such  that  a 
node  will  not  be  in  the  same  group  with  any  nodes  that 
share  an  edge  with  it.  For  the  seven-point  stencil  finite- 
difference  scheme  currently  used  in  STRIDE,  only  two 
colors  are  necessary  and  the  ordering  scheme  is  called  red- 
black  ordering.  The  price  for  the  red-black  ordering  is  an 
increase  in  iteration  number.  From  our  experience,  the 
increase  in  the  iteration  numbers  is  about  30%  for  sym¬ 
metric  matrices  derived  from  uniform  grids  and  about 
double  for  asymmetric  matrices.  Still  the  advantages 
stemming  from  the  ability  to  fully  vectorize  the  entire  ma¬ 
trix  solver  operation  outweighs  its  penalties.  The  perfor¬ 
mance  of  the  algorithms  were  measured  in  terms  of  CPU 
time  per  linear  iteration.  When  measured  in  terms  of  CPU 
time  per  linear  iteration  per  variable,  the  raw  speed  ad¬ 
vantage  of  vector  over  scalar  for  ICCG  and  ILUCGS  type 
of  iterative  matrix  solvers  is  about  4.5  using  the  Convex 
Cl  and  9.5  using  the  Cray  YMP.4  Therefore,  even  with 

<The  veclonzation  on  iPSC/2  was  not  pursued  because  of  the  design  flaw 
in  vector  processing  unit  (VPU).  As  it  stands,  a  VPU  can  only  access  about 
one  eighth  of  the  total  memory  and  a  complete  vectorization  of  the  iterative 
solvers  would  entail  constant  data  swapping.  The  newly  available 
i860-based  systems  does  not  have  such  a  problem. 


the  worst-case  situation,  red-black  ordering  reduces  the 
computation  time  of  ICCG  and  ILUCGS  operations  by 
more  than  50%  on  Convex  Cl  and  more  than  80%  on 
Cray  YMP. 

When  implemented  on  Convex  Cl  and  Cray  YMP,  the 
iterative  matrix  solvers  in  STRIDE  are  able  to  run  at  2 
MFlops  on  the  Cl  and  100  MFlops  on  the  YMP. 

IV.  Acceleration  of  Two-Carrier  Gummel  Style 
Iterations 

Having  achieved  dramatic  improvement  in  the  conver¬ 
gence  performance  of  Gummel  style  nonlinear  scheme  at 
high  level  injection  using  a  MSP  scheme  [5],  our  attention 
turned  to  the  application  of  MSP  and  its  extensions  to 
Gummel  style  iterations  in  two-carrier  simulation.  We  will 
call  the  MSP  scheme  proposed  in  [5]  MSP- 1C,  with  1C 
added  for  one-carrier. 

For  completeness,  the  key  formula  for  MSP- 1C  is 
shown  in  the  following: 

(5) 

The  key  point  from  the  discussion  of  MSP- 1C  [5]  is  that 
in  the  n-type  region  where  the  charge  neutrality  prevails, 
(5)  is  quite  accurate  and  its  substitution  into  the  linearized 
continuity  equation  will  retain  much  of  the  coupling  be¬ 
tween  Poisson  and  continuity  equations,  thereby  improv¬ 
ing  the  convergence  performance  of  the  Gummel  style 
nonlinear  iteration  scheme. 

Two  simple  extensions  of  the  MSP- 1C  scheme,  which 
retain  the  advantage  of  low  computational  cost  per  itera¬ 
tion,  can  be  used  in  two-carrier  simulations.  One  is  to 
apply  MSP- 1C  to  the  “main”  carrier  equation,  such  as 
the  electron  continuity  equation  in  n-p-n  transistor  simu¬ 
lations.  The  other  is  to  use  MSP- 1C  separately  on  each 
continuity  equation.  The  advantage  of  these  extensions  is 
low  computational  cost  per  iteration.  For  both  cases,  the 
presence  of  the  other  carrier  is  ignored  as  far  as  MSP- 1C 
is  concerned.  Therefore,  dramatic  improvement  in  con¬ 
vergence  performance  is  not  to  be  expected.  Neverthe¬ 
less.  significant  improvement  has  been  observed  over  the 
traditional  Gummel’s  scheme  with  the  asymptotic  con¬ 
vergence  rate  for  these  schemes  ranging  from  four  to  six 
of  that  for  the  Gummel  iteration  in  the  high-level  injection 
regime.  However,  these  increasing  convergence  rates  are 
still  very  low  with  the  error  typically  halving  every  six  to 
seven  iterations. 

These  unsatisfactory  results  prompted  us  to  explore  new 
schemes.  Our  first  approach  was  to  use  a  “true”  exten¬ 
sion  of  MSP-1C,  the  MSP-2C  scheme.  The  key  formula 
for  this  MSP-2C  scheme  is  shown  as  follows: 

D^64>  +  D^6$n  +  D+4r6$p  =  -/>  (6) 

Comparing  (5)  with  (6),  the  terms  associated  with  changes 
in  both  carrier  variables  are  included,  thereby  the  name 
MSP-2C.  When  (6)  is  substituted  into  the  linearized  con¬ 
tinuity  equations  of  both  carriers,  we  obtain  a  matrix  with 
a  dimension  of  2  N  by  2  N.  This  matrix  can  be  expressed 
in  terms  of  the  original  matrices  as  follows: 


(7) 


As  the  size  of  AMSP.2C  indicates,  the  two  continuity  equa¬ 
tions  are  solved  together  with  the  MSP-2C  while  Pois¬ 
son’s  equation  is  still  solved  separately.  Solution  of  larger 
matrices  results  in  an  increase  in  both  the  data  storage  and 
CPU  time  per  iteration.  This  is  the  major  factor  that  re¬ 
duces  the  maximum  simulatable  node  count  from  that  pre¬ 
viously  reported  to  130  K  to  about  100  K.  CPU  time  per 
iteration  has  been  observed  to  roughly  double. 

Somewhat  to  our  surprise,  the  incorporation  of  MSP-2C 
does  little  to  improve  the  convergence  of  normal  transis¬ 
tor  simulations  beyond  what  has  been  achieved  by  the 
MSP-1C  scheme,  although  in  the  latch-up  analysis,  the 
convergence  behavior  of  the  latched  device  improves  dra¬ 
matically  with  the  error  at  least  halving  for  every  itera¬ 
tion. 

In  order  to  stay  within  the  memory  requirement  of 
MSP-2C  scheme  instead  of  going  full  Newton  and  a  dou¬ 
ble  of  memory  requirement,  we  next  turned  an  algorithm 
which  solves  Poisson  equation  with  the  “main”  carrier 
equation  such  as  the  electron  equation  in  a  n-p-n  transis¬ 
tor,  the  Newton-lC  scheme.5  An  additional  motivation  for 
using  the  Newton- 1C  scheme  was  the  observation  that 
throughout  the  simulation  of  normal  bipolar  transistor  op¬ 
eration,  the  coupling  between  Poisson  and  the  “minor” 
carrier  equation  remained  very  weak6  even  though  the  de¬ 
vice  itself  had  gone  into  the  strong  high-level  injection 
regime. 

Fig.  1  shows  the  convergence  results  for  Gummel, 
MSP- 1C  and  Newton-lC  schemes  for  simulations  done 
on  a  bipolar  transistor.  VCB  is  fixed  at  5  V.  The  number 
of  nodes  is  about  13  700  and  the  simulations  are  executed 
on  eight  processors  with  an  estimated  parallel  efficiency 
of  72 % .  As  shown  in  Fig.  1 ,  at  the  highest  injection  level, 
MSP- 1C  is  about  three  times  faster  than  Gummel,  while 
Newton-lC  is  still  three  times  faster  than  MSP-1C,  de¬ 
spite  the  doubling  in  CPU  time  per  iteration.  Although 
the  full  Newton  scheme  is  not  yet  available  from  STRIDE, 
Newton-lC  is  expected  to  be  faster  than  the  full  Newton 
scheme  since  CPU  time  per  iteration  for  the  full  Newton 
is  expected  to  be  twice  of  that  for  Newton-lC.  For  in¬ 
stance,  for  the  test  example  in  the  next  section,  a  total  of 
eighteen  iterations  are  needed  for  convergence,  which  is 
CPU  equivalent  to  less  than  eight  full  Newton  iterations. 
Given  the  severity  of  the  test  example,  it  is  very  unlikely 
that  the  full  Newton  scheme  can  converge  in  less  than 
eight  iterations. 

It  should  be  noted  that  the  kind  of  matrix  solvers  used 
in  a  device  solver  affects  the  results  obtained  for  using  the 


’The  Newton-lC  scheme  was  used  in  some  early  works  on  device  sim¬ 
ulation.  such  as  an  early  version  MINIMOS  and  Dr.  J.  W.  Slotboom's 
initial  work  on  2-D  simulation  some  15  a  ago. 

‘This  can  be  ascertained  by  noticing  that  the  error  of  the  other  continuity 
equation  is  several  orders  below  that  of  the  main  continuity  equation. 
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Fig.  1.  Two-carrier  convergence  results  of  various  schemes. 

MSP  and  Newton-lC  schemes.  When  a  device  solver  uses 
iterative  device  solvers,  forward  elimination  and  backsub- 
stitution  as  well  as  matrix-vector  multiplication  are  the 
most  CPU  intensive  operations.  The  cost  of  these  opera¬ 
tions  grows  as  the  square  of  the  number  of  variables  per 
node.  In  a  device  solver  using  a  direct  matrix  solver,  how¬ 
ever,  the  cost  of  matrix  factorization  usually  dominates 
CPU  time  and  it  grows  as  the  cube  of  the  number  of 
variables  per  node.  Therefore,  the  proper  use  of  the  MSP 
and  Newton-lC  schemes  in  such  a  device  solver  are  ex¬ 
pected  to  yield  an  even  more  favorable  result  in  compar¬ 
ison  with  the  use  of  the  full  Newton  scheme. 

Given  the  results  of  these  schemes,  one  might  ponder 
the  reason  behind  the  relative  success  of  MSP  in  one-car¬ 
rier  simulation  and  its  relative  ineffectiveness  in  two-car¬ 
rier  simulations  even  though  the  minor  carrier  equation  is 
only  weakly  coupled  to  the  Poisson  equation.  The  key 
formula  of  the  MSP  scheme,  (5)  and  (6),  relates  the  rel¬ 
evant  carrier  concentration  with  the  value  of  \p  at  the  high 
carrier  concentration  nodes.  Therefore,  MSP  performs 
best  when  the  high-level  injection  only  causes  the  local 
coupling  between  Poisson  and  continuity  equation(s),  that 
is,  when  the  value  of  yp  at  a  node  is  the  dominant  factor 
in  determining  the  carrier  concentration  at  that  node.  This 
is  the  situation  in  the  inversion  layer  of  MOSFET’s  where 
the  conduction  charge  is  induced  electrostatically.  The 
situation  for  the  two-carrier  simulation  is  very  different. 
High-level  injection  of  a  bipolar  transistor  almost  always 
accompanies  the  Kirk  effect  [16],  i.e.,  the  base  push-out 
effect.  When  the  Kirk  effect  occurs,  the  carrier  concentra¬ 
tion  in  the  lightly  doped  collector  region  is  determined  not 
by  the  local  values  of  \p,  but  rather  the  amount  of  the  col¬ 
lector  current  that  needs  to  be  sustained.  This  is  in  turn 
determined  by  the  injection  level  at  the  base-emitter  junc¬ 
tion.  On  the  other  hand,  the  amount  of  carrier  concentra¬ 
tion  also  significantly  impacts  the  ^  distribution  in  the 
lightly  doped  collector  region.  Therefore,  a  nonlocal  cou¬ 
pling  exists  between  the  Poisson  and  continuity  equa¬ 
tions.  Since  MSP  schemes  are  only  capable  to  take  into 
account  the  local  coupling  between  Poisson  and  continu¬ 
ity  equation(s),  they  are  relatively  ineffective  in  improv¬ 
ing  the  convergence  of  Gummel  style  iterations  in  normal 
bipolar  simulations.  On  the  other  hand,  since  it  takes  into 
account  the  complete  coupling  between  Poisson  and  the 


main  carrier  equation  by  solving  them  together,  the  New¬ 
ton- 1C  scheme  is  able  to  achieve  fulI-Newton  competitive 
convergence  performances,  given  that  the  minor  carrier 
equation  is  virtually  independent  of  the  Poisson  equation. 

The  fact  that  Newton- 1C  is  more  expensive  than 
MSP- 1C  in  CPU  time  per  iteration  presents  an  issue  of 
when  to  switch  from  Gummel  or  MSP- 1C  schemes  to 
Newton- 1C  in  order  to  optimize  the  overall  solution  time. 
During  the  implementation  of  MSP- 1C  scheme,  we  found 
an  interesting  parameter  which  has  important  bearing  on 
this  issue.  Excluding  the  recombination  terms,  the  dis¬ 
crete  current  continuity  equation  at  node  i  consists  of  the 
sum  of  all  the  current  components  flowing  into  the  node. 
For  an  electron  current  component  which  flows  into  node 
i  along  the  edge  connecting  the  nodes  i  and  j(In  iJ),  there 
are  contributions  to  diagonal  (dln  ij/d$„(i))  and 

off-diagonal  (d/rt.1J/d4>„(j)).  From  the  point  of 

view  of  matrix  formulation,  the  application  of  MSP- 1C 
scheme  means  a  modification  to  the  original  matrix  of  the 
current  equation  (i.e.,  modified  by  —-4*,* 

•  DD^D^  for  electron  equation  [5]).  For  /„  ,,,  the 
modification  to  is  —  (dl„  y / dip(i ))D ^ (i )D^(i' ) . 

A  detailed  analysis  reveals  that,  if  4>„(/j  <  4>„(y ),  then 
the  modification  strengthens  A*„*„(/7),  i.e.,  it  has  the  same 
sign  with  /4+„«,„(i/);  otherwise,  A^in(ii)  is  weakened  by 
the  modification.  This  diagonal  weakening  merits  special 
attention,  since,  according  to  our  experience,  all  the  di¬ 
agonals  in  a  matrix  must  have  the  same  sign  for  the  iter¬ 
ative  matrix  solvers  to  converge.  It  turns  out  that  the  ratio 
(0nl))  between  the  modification  (~(dl„(ij)/ 
d\l/(i))D^(i)D^jii))  and  the  original  term  ( dl„(ij )/ 
34 >„(/))  is  never  less  than  minus  one.  This  ensures  the  di¬ 
agonals  of  will  not  become  negative  as  the  result  of 
MSP  modifications.  Furthermore,  the  most  negative  of  all 
the  8„(ij)  can  be  used  as  a  parameter,  indicating  the  de¬ 
gree  to  which  the  original  has  been  modified.7  Our 
experience  indicates  that  when  the  magnitude  of  this  pa¬ 
rameter  is  small,  the  Gummel  iteration  is  just  as  effective 
as  any  other  nonlinear  scheme.  When  the  magnitude  ap¬ 
proaches  to  one,  however,  the  Gummel  iteration  becomes 
very  slow  and  a  more  elaborate  scheme  has  to  be  em¬ 
ployed  to  accelerate  the  convergence.  Therefore,  this  pa¬ 
rameter  is  also  a  very  good  indication  of  the  strength  of 
the  coupling  between  Poisson's  equation  and  the  conti¬ 
nuity  equation.  A  scheme  is  thereby  implemented  in 
STRIDE  to  use  Newton-lC,  if  the  user  desires  to  do  so, 
when  the  magnitude  of  this  matrix  parameter  is  large 
enough.  Currently,  the  threshold  value  is  0.25.  The  low 
bias  results  of  Newton- 1C  curve  in  Fig.  1  were  actually 
obtained  with  Gummel  or  MSP- 1C  schemes.  By  the  same 
token,  this  switch  scheme  can  also  be  extended  to  use  the 
full  Newton  scheme  when  it  is  truly  necessary.  In  short, 
the  introduction  of  this  matrix  parameter  makes  it  feasible 
for  the  program  to  automatically  choose  the  best  nonlinear 
scheme  at  its  disposal.  Although  originated  in  the  context 

'Although  an  electron  equation  is  used  as  the  example,  a  similar  param¬ 
eter  can  also  be  calculated  for  the  hole  equation. 


of  the  finite  difference  approach,  this  parameter  can  also 
be  calculated  in  device  solvers  using  the  finite  element 
approach  with  only  minimum  overhead. 

V.  Convergence  Criterion  and  Damping  Schemes 

Traditionally,  the  convergence  criterion  for  the  carrier 
variables  have  been  that  the  maximum  relative  change  of 
all  the  variables  is  below  a  certain  value.8  In  contrast  to 
this  relative  criterion,  there  is  also  the  absolute  criterion 
which  measures  the  convergence  of  residual  of  the  differ¬ 
ence  equations.  Although  the  absolute  criterion  is  useful 
in  a  global  damping  scheme  [17],  its  application  as  a  con¬ 
vergence  criterion  is  less  useful.  For  example,  the  resid¬ 
ual  of  the  current  continuity  equation  which  can  be  tol¬ 
erated  depends  heavily  on  the  amount  of  current  flowing 
through  the  simulating  device,  which  can  differ  by  many 
orders  of  magnitude.  It  is,  therefore,  not  always  possible 
to  determine  a  priori  what  is  the  appropriate  tolerance 
level  for  the  residual.  Another  important  criteria  that  has 
not  been  widely  used  is  the  convergence  of  the  terminal 
currents,  which  also  entails  the  conservation  of  all  the  ter¬ 
minal  currents.  When  we  monitor  the  convergence  of  both 
the  currents  and  the  variables,  we  found  that  in  the  low 
current  regime,  the  variables  may  converge  before  the 
currents  do.  Instances  were  observed  in  which  the  cur¬ 
rents  do  not  converge  until  the  maximum  relative  change 
in  carrier  variable  fell  below  1CT9  or  even  lower.  By  com¬ 
bining  these  two  convergence  criteria,  we  are  able  to  re¬ 
duce  the  lower  limit  above  which  the  currents  can  be  cal¬ 
culated  self-consistently.  On  the  other  hand,  at  high 
current  levels,  variables  may  lag  far  behind  the  currents 
in  convergence.  In  this  regard,  we  observed  instances  in 
which  the  convergence  of  current  was  achieved  before  the 
maximum  relative  change  in  the  carrier  variable  reached 
below  10“:.  We  feel  that  it  is  of  not  much  use  to  calculate 
the  variables  to  five  or  six  digits  of  accuracy  when  the 
currents  have  converged.  Based  on  the  previous  obser¬ 
vations.  we  have  chosen  a  combination  of  a  relatively 
loose  variable  convergence  criterion  about  10  with  a 
current  convergence  criterion  of  about  10-4  which  has 
worked  quite  well  for  us  so  far  regardless  of  current  lev¬ 
els. 

A  more  serious  issue  is  that  for  a  device  simulator  using 
iterative  matrix  solvers,  the  absolute  convergence  of  car¬ 
rier  variables  becomes  much  more  difficult  since  matrices 
are  usually  solved  to  a  less  accurate  extent  than  direct 
solvers  are.  In  fact,  we  found  that  when  the  traditional 
Slotboom  variables  are  used,  the  convergence  of  these 
variables  become  almost  impossible  in  typical  two-carrier 
simulations.  When  the  scaled  Slotboom  variables  are 
used,  however,  more  often  than  not.  this  apparent  non¬ 
convergence  of  the  concentration  variables  does  not 
impede  the  convergence  of  current  which  is  also  accom¬ 
panied  by  the  convergence  in  the  errors  in  the  continuity 
equations.  Therefore,  we  have  encountered  situations  in 

*The  convergence  criteria  for  ^  usually  measures  its  maximum  absolute 
change  since  it  is  well  scaled  (itf  «  I  means  the  solution  is  very  close.) 


which  the  solutions  have  converged  for  all  practical  pur¬ 
poses,  but  were  not  recognized  by  the  traditional  conver¬ 
gence  criterion  for  the  concentration  variables.  Therefore, 
the  question  is  whether  we  can  find  a  better  criterion  which 
can  tell  when  a  nonconvergence  of  carrier  variables  is  for 
real.  The  clue  again  lies  in  (S)  and  (6).  In  the  first-order 
approximation,  the  terms  like  and  -D^ 

*  represent  the  changes  in  \p  necessary  to  restore  the 
charge  balance  required  by  Poisson’s  equation.  If  these 
changes  are  very  small  after  the  solution  of  carrier  vari¬ 
ables,  then  there  are  virtually  no  changes  necessary  in  \pw 
and  we  have  a  converged  solution.  A  detailed  look  at  these 
terms  reveals  that  they  are  the  relative  changes  in  carrier 
variables  multiplied  by  a  weighting  vector  whose  values 
are  no  larger  than  unity.  Therefore,  the  convergence  cri¬ 
terion  for  the  carrier  variables  can  be  these  weighted  rel¬ 
ative  changes.  The  weighting  coefficients  are  very  small 
for  the  minority  carrier,  and  approach  one  for  the  majority 
carrier  in  the  charge  neutral  region.  In  essence,  this  cri¬ 
terion  discounts  the  changes  in  minority  carrier  concen¬ 
tration  as  long  as  these  changes  do  not  significantly  per¬ 
turb  Poisson’s  equation  nor  impede  the  convergence  of 
terminal  currents.  There  is  a  very  good  correlation  be¬ 
tween  the  new  measures  and  the  maximum  error  in  the 
continuity  equations  and  we  are  able  to  distinguish  the 
practically  converged  solutions  from  the  apparently  non¬ 
converging  solutions  according  to  the  old  measures.  The 
data  storage  and  computational  cost  of  the  new  measures 
are  minimum  since  only  vector  (not  matrix)  operations  are 
involved. 

One  of  the  challenging  issues  of  the  nonlinear  iteration 
schemes  is  how  to  choose  a  robust  damping  scheme  to 
ensure  global  convergence.  The  schemes  used  in  STRIDE 
for  Gummel  and  MSP  schemes  have  been  discussed  in 
[5].  For  the  MSP-2C  scheme,  both  and 

D^p  are  considered  and  different  limiting  values  are 
used.  The  problem  for  the  Newton- 1C  scheme  is  more 
difficult  since  we  have  two  variables  with  vastly  different 
ranges  and  a  change  in  \p  impacts  the  carrier  concentration 
exponentially .  Our  first  attempt  was  to  try  various  resid¬ 
ual  limiting  schemes  such  as  suggested  in  [17].  Applica¬ 
tions  based  on  their  norm  reduction  principles  have  proven 
to  be  quite  successful  in  nonlinear  iteration  of  Poisson’s 
equation  [5].  The  results  have  not  been  consistent,  how¬ 
ever.  The  key  difficulty  here  is  how  to  weigh  the  residuals 
from  the  Poisson  and  continuity  equations. 

Our  recent  attempts  are  based  on  a  different  principle. 
Since  \p  impacts  carrier  concentration  exponentially,  the 
matrix  will  change  dramatically  for  a  large  change  in  ip. 
Therefore,  the  changes  in  \p  should  be  restricted  so  as  not 
to  cause  large  changes  in  carrier  concentration  which 
would  greatly  upset  the  charge  balance  in  the  previous 
solution.  Similarly,  the  changes  in  the  carrier  variable 
should  be  restricted  to  require  only  a  modest  change  in  \p 
to  restore  the  charge  balance.  In  essence,  our  scheme  is  a 
trusted  region  approach,  widely  used  in  the  nonlinear  op¬ 
timization  community,  with  the  trusted  region  determined 
based  on  the  specific  knowledge  of  semiconductor  equa- 


Fig.  2.  Global  damping  scheme:  evolution  of  potential  update. 


tions.  In  the  actual  implementation,  a  damping  coefficient 
a  is  first  calculated  based  upon  the  above  principles,  and 
the  errors  in  Poisson  and  continuity  equation  are  re-eval- 
uated  after  the  damped  variable  update  to  monitor  the 
change  in  these  errors.  The  use  of  a  damped  update  on 
the  first  try  prevents  the  problem  of  machine  overflow 
which  may  occur  with  a  full  update  when  5\p  become  too 
large.  It  is  rare  that  a  such  calculated  a  still  produces  an 
update  which  causes  a  dramatic  increase  in  these  errors. 
When  this  occurs,  a  is  further  reduced  to  brighten  down 
these  increases.  Our  experience  shows  that  it  is  essential 
not  to  insist  on  the  reduction  of  residuals  as  long  as  their 
increase  is  moderate.  Otherwise,  excessive  damping  may 
occur  which  slows  the  convergence  process  to  a  crawl. 
Fig.  2  shows  the  evolution  of  the  magnitude  of  ip  update 
generated  by  the  Newton- 1C  scheme  during  a  n-p-n  bi¬ 
polar  transistor  simulation  for  an  initial  bias  of  VCE  =  5 
V  and  Kbe  =  IV.  This  is  a  very  severe  test  example  for 
the  global  damping  scheme  in  that  a  bipolar  transistor  is 
biased  into  heavy  high-level  injection  regime  in  one  step. 
Fig.  3  shows  the  values  of  the  damping  coefficient  a  used, 
while  Fig.  4  shows  the  errors  in  the  Poisson  and  conti¬ 
nuity  equations.  Both  errors  are  normalized  by  their  re¬ 
spective  starting  errors  at  iteration  7.  The  number  of  it¬ 
eration  starts  at  7  since  Newton- 1C  was  first  engaged  at 
this  iteration  after  solution  was  settled  down  somewhat. 
Although  the  initial  (normalized)  \p  update  is  more  than 
100,  which  translated  to  about  3  V,  it  goes  down  to  less 
than  one  (about  26  mV)  after  just  seven  iterations.  After- 


Fig.  4.  Global  damping  scheme:  evolution  of  the  residuals. 


wards,  the  solution  converges  superiinearly  as  can  be  seen 
from  the  evolution  of  Poisson  errors. 

VI.  User  Interfaces  and  Tool  Integration 

To  provide  easier  use  of  STRIDE,  we  have  created  a 
utility  program  that  encapsulates  the  details  of  the  pro¬ 
gram  and  its  algorithms  from  the  user.  The  program,  based 
on  the  PISCES-IIB  utility,  BIPMESH,  provides  auto¬ 
mated  input  deck  generation,  which  includes  mesh, 
model,  and  bias  information.  The  program  asks  the  user 
questions  about  structure  geometry,  doping  profiles,  and 
bias  conditions.  From  this  input,  mesh  is  automatically 
generated  using  heuristics  based  on  our  experience  with 
PISCES-IIB  simulation.  Doping  profiles  may  be  either 
analytic  functions,  SUPREM-IV  profiles,  or  SUPREM- 
III  profiles.  The  extension  in  the  other  dimension(s)  for 
SUPREM-IV(III)  profiles  is(are)  based  on  an  error  func¬ 
tion  approximations  in  the  other  dimension(s). 

The  use  of  automated  input  deck  generation  facilitates 
integration  of  STRIDE  into  an  integrated  simulation  sys¬ 
tem  for  TCAD.  STRIDE  will  be  included  as  part  of  a  suite 
of  device  simulation  tools  in  an  integrated  simulation  en¬ 
vironment  based  upon  SIMPL-IPX  [18]. 

To  help  the  user  interpret  the  output  of  STRIDE,  a  link 
to  Visualization  tools  such  as  NCSA  Xlmage  [19]  and 
Spyglass  Dicer  are  provided.  By  mapping  solution  values 
to  color,  one  is  able  to  see  spatial  variations  of  the  solu¬ 
tion  variable  much  more  rapidly.  In  addition  one  can  an¬ 
imate  sequences  of  frames,  useful  for  simulating  transient 
effects  and  for  looking  at  dc  sweeps.  The  translators  from 
STRIDE  data  format  to  the  format  of  these  Visualization 
tools  operate  as  stand-alone  utilities. 

VII.  3-D  Latch-Up  Analysis  with  STRIDE 

To  fully  appreciate  the  benefits  of  3-D  device  simula¬ 
tion,  the  relation  of  layout  to  latch-up  conditions  were  in¬ 
vestigated.  Many  studies  in  2  D,  such  as  [20]  and  [21], 
have  shown  that  latch-up  trigger  current  is  lower  than  the 
expected  value  based  on  circuit  models  [22]  due  to  de¬ 
biasing  under  the  P+  contact  as  shown  in  Fig.  5.  The 
current  flow  in  the  n-well  under  the  P  +  induces  a  voltage 
drop.  This  voltage  drop  in  tum  forward  biases  the 
P+ /n-well  junction,  resulting  in  injected  current  into  the 


Fig.  S.  Structure  for  2-D  latch-up  analysis. 


substrate.  The  injected  current  into  the  substrate  debiases 
the  injector  contact,  resulting  in  sustained  latch-up.  In 
terms  of  the  circuit  model,  lateral  injection  causes  the  ver¬ 
tical  p-n-p  to  conduct,  which  in  tum  increases  the  biases 
on  the  lateral  n-p-n. 

The  trigger  current  calculated  from  the  2-D  structure 
can  be  judged  as  very  conservative  when  compared  to  the 
case  seen  in  an  actual  layout  of  an  inverter  as  shown  in 
Fig.  6,  where  all  the  injected  current  does  not  flow  under 
the  P+.  Furthermore,  the  N+  S/D  can  face  either  a 
N-well  contact  (case  B)  as  shown  or  a  P+  S/D  (Case  A). 
For  most  standard  cell  layout,  case  B  is  more  typical. 

The  simulation  structure  for  3  D  contains  about  10  000 
nodes  with  dimensions  of  10  /im  in  the  x-direction,  10  fim 
in  the  y-direction,  and  7  pm  in  the  z-direction.  A  buried 
layer  is  used  on  the  bottom  of  the  structure.  Four  contacts 
are  used— three  on  the  surface,  with  two  different  injector 
contact  positions  as  shown  in  Figs.  7  and  8,  and  one  on 
the  bottom  acting  as  a  substrate  contact.  Doping  profiles 
are  analytic  functions  for  simplicity.  The  n-well  and  P+ 
contact  inside  the  n-well  are  held  at  2-V  potential,  while 
the  substrate  contact  is  held  at  0  V.  The  voltage  at  the 
N  +  injector  is  negatively  biased  until  the  current  at  the 
node  dramatically  increases  and  the  carriers  flood  the  de¬ 
vice — indicating  a  latch-up  condition.  Using  MSP-2C,  the 
average  time  to  simulate  the  trigger  current  is  about  6  h 
on  the  16  node  hypercube,  corresponding  to  the  use  of 
four  bias  points.  The  output  data  from  STRIDE  are  post- 
processed  into  a  format  which  is  readable  by  the  visual¬ 
ization  tool  Spyglass  Dicer. 

The  positioning  of  the  injector  contact  in  the  substrate 
is  very  important  in  determining  the  value  of  trigger  cur¬ 
rent  of  this  structure.  Based  on  the  debiasing  mechanism 


Fig.  7.  3-D  latchup  analysis  structure:  case  A. 


Fig.  8.  3-D  latchup  analysis  structure:  case  B. 


Fig.  9.  3-D  plot  of  voltage  drop  at  latchup  onset:  case  A. 


for  the  2  D  discussed  above,  it  would  seem  plausible  that 
the  trigger  current  from  case  A  would  be  lower  than  that 
from  case  B  since  more  of  the  injected  current  would  tend 
to  flow  under  the  P+  contact  for  case  A.  STRIDE  simu¬ 
lations  do  indeed  confirm  this,  as  seen  from  Figs.  9  and 
10.  These  3-D  pictures,  taken  from  Spyglass  Dicer,  show 
a  voltage  drop  minus  its  equilibrium  value)  as  a  func¬ 
tion  of  the  position  prior  to  latch  up  (injected  currents 
(Table  I)  are  different).  Differences  in  shading  corre¬ 
sponds  to  gradients  in  voltage  drop. 

The  main  difference  between  the  gray  scale  figures  can 
been  seen  at  the  N  +  well  contact  (top  left  comer  of  each 
of  the  images).  It  is  not  possible  to  examine  the  voltage 
drops  at  the  P+  contact  in  the  N-well  since  at  the  onset 
of  latch-up  the  amount  of  debiasing  is  similar  due  to  the 
exponential  voltage  dependence  of  current.  The  point  to 
remen  ber  is  that  the  device  will  latch  up  at  a  particular 
current  flow  in  the  well.  For  case  A,  the  region  around  the 


TABLE i 

Latch-Up  Trigger  Current 


Trigger  Current  (mA) 


Ci se  A  Case  B 


0.244  0.482 


N  +  contact  in  the  direction  away  from  the  injector  is  dark, 
corresponding  to  an  asymmetric  current  flow  into  the  con¬ 
tact.  This  current  is  mostly  traveling  through  the  well  in 
the  direction  of  the  P+  contact.  In  contrast,  case  B  shows 
uniform  shading  around  the  N+  contact,  showing  that  the 
injected  current  is  being  effectively  collected  by  the  both 
sides  of  the  N  +  contact.  Thus  case  B  should  be  collecting 
more  total  injected  current  at  the  onset  of  latchup.  This 
finding  is  confirmed  in  Table  I. 

VIII.  Conclusions 

In  this  paper,  we  have  reported  progress  in  3-D  device 
simulation,  focusing  on  computational  aspects.  By  exclu¬ 
sively  using  iterative  matrix  solvers  and  insisting  on  low 
memory  usage  nonlinear  iteration  schemes,  approxi¬ 
mately  100  000  nodes  can  be  solved  on  a  user  memory  of 
about  100  M  bytes.  ICCG  and  ILUCGS  types  of  iterative 
solvers  are  efficiently  vectorized  by  using  both  the  inde¬ 
pendent  edge  grouping  scheme  and  the  red-black  order¬ 
ing.  Various  MSP  schemes  are  explored  to  improve  the 
convergence  of  two  carrier  simulation  using  Gummel  style 
nonlinear  schemes.  They  not  only  offer  significant  im¬ 
provement  by  themselves,  but  also  provide  the  insight 
leading  to  the  automatic  selection  of  nonlinear  schemes 
which  offers  the  best  CPU  performance.  Issues  of  the  con¬ 
vergence  criterion  and  global  convergence  scheme  have 
also  been  successfully  addressed  with  the  insight  provided 
by  MSP  schemes.  A  better  user  interface  has  been  devel¬ 
oped  to  facilitate  program  usage  and  tool  integration. 
Aided  with  graphics  capabilities  made  available  through 
various  graphices  tool  links,  layout  dependence  of  latch¬ 
up  trigger  current  has  been  successfully  analyzed. 
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Abstract 


This  paper  presents  a  methodology  for  adapting  PDE  solvers  for  parallel  execution  based  upon 
the  single-program-multiple-data  programming  paradigm.  Our  approach  minimizes  changes  to 
existing  code  and  data  structures,  thereby  preserving  the  value  captured  within  “dusty-deck”  pro¬ 
grams.  The  resulting  parallel  application  is  easily  portable  to  most  message-passing  distributed- 
memory  architectures.  To  demonstrate  the  viability  of  our  methodology,  the  commercially-avail- 
able,  semiconductor  modelling  program,  PISCES,  has  been  adapted  for  parallel  execution.  Our 
experience  shows  that  all  non-local  references  can  be  resolved  through  careful  choreography 
without  extensive  modifications  to  the  original  code.  The  parallel  simulator  currently  runs  on  the 
Intel  iPSC/860  and  the  Thinking  Machines  CM-5.  Simulating  realistic  complex  device  structures, 
we  have  achieved  remarkable  performance  gains  over  high-performance  serial  workstations.  We 
also  demonstrate  the  ability,  due  to  the  scalability  of  the  parallel  simulator,  to  simulate  structures 
too  large  for  our  existing  serial  computers.  This  simulation  capability  could  provide  immeasur¬ 
able  benefits  in  the  competitive  semiconductor  industry. 


1.  Introduction 

As  time-to-market  for  integrated  circuits  diminishes  and  manufacturing  technology 
becomes  increasingly  complex  and  costly,  semiconductor  device  simulators  become  crucial  in 
quantifying  the  electrical  behavior  of  devices.  Simulators  are  now  used  to  perform  not  only  design 
verification  but  also  manufacturability  and  scalability  studies.  The  interplay  between  adjacent 
devices  as  they  are  scaled  to  submicron  sizes  becomes  more  important  and  can  dramatically 
increase  the  memory  and  execution  time  requirements  of  simulations.  These  two  trends  mean 
device  simulators  must  realize  significant  performance  gains  to  provide  reasonable  turnaround  for 
device  and  process  designers.  The  use  of  scalable,  high-performance  parallel  architectures  to 
deliver  that  performance  will  become  strategically  important  in  the  continued  success  of  large- 
scale  device  simulation. 

In  recent  years,  distributed-memory  parallel  computers  built  with  high-performance 
microprocessors  and  supporting  the  message  passing  paradigm  have  become  the  standard  com¬ 
mercial  parallel  architecture.  Most  commercial  vendors  (e.g.,  IBM,  Cray,  Intel,  and  Thinking 
Machines)  produce  machines  of  this  type.  Likewise,  the  single-program-multiple-data  (SPMD) 
parallel  programming  model  [1]  has  become  a  popular  paradigm  for  developing  parallel  applica¬ 
tions  for  this  class  of  machines.  Emergence  of  a  single  hardware  abstraction  and  its  requisite  pro¬ 
gramming  model  should  portend  the  migration  of  parallel  machines  from  research  laboratories  to 
industrial  production  environments.  Unfortunately,  one  key  piece  of  the  puzzle  is  missing:  indus¬ 
trial  applications.  Although  many  researchers  have  developed  codes  for  parallel  architectures, 
commercial  software  development  has  been  slow  because  the  programming  model  required  to 
take  advantage  of  these  architectures  is  a  radical  departure  from  traditional  paradigms.  Addition¬ 
ally,  most  commercial  users  are  unwilling  to  discard  the  knowledge  and  expertise  captured  by 
their  existing  “dusty-deck”  programs  in  exchange  for  a  faster  yet  unproved  and  unfamiliar  parallel 
code.  One  way  to  bootstrap  parallel  application  development  is  to  port  existing,  well  accepted 
industrial  applications  to  parallel  machines  while  retaining  the  original  code  and  data  structures 
without  major  modifications.  A  method  of  parallelization  that  preserves  knowledge  inherent  in 
“dusty-decks”  pleases  users  by  maintaining  a  familiar  environment  while  providing  an  improved 
solution  capability.  This  approach  also  aids  parallel  software  and  hardware  designers  by  rapidly 
exposing  issues  involved  with  running  full-scale  industrial  applications  on  parallel  machines.  In 
this  paper  we  present  a  methodology  for  creating  parallel  grid-based  partial  differential  equation 
(PDE)  solvers  from  serial  ones  with  minimal  changes  to  the  existing  code  and  data  structures.  Our 
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method  is  based  upon  the  SPMD  programming  model  and  is  easily  portable  to  most  message¬ 
passing  distributed-memory  machines.  To  verify  our  method  and  to  illustrate  the  potential  for  pro¬ 
viding  vastly  improved  “dusty -deck”  performance,  we  have  adapted  a  commercially-supported 
and  industry-standard  device  simulator,  PISCES  [2,3],  for  parallel  execution.  The  code  currently 
runs  on  two  distributed-memory  architectures:  the  Intel  iPS C/860  and  the  Thinking  Machines 
CM-5. 

An  outline  of  this  paper  follows:  After  a  brief  description  of  the  serial  semiconductor 
device  simulator,  we  present  in  Section  3  our  methodology  for  adapting  PDE  solvers  for  parallel 
execution  accompanied  by  a  discussion  of  our  experiences  adapting  PISCES  for  parallel  execu¬ 
tion.  In  Section  4,  simulation  results  from  a  series  of  increasingly  complex  grids  defining  a  large, 
multi-device  structure  illustrate  the  utility  of  the  parallel  application.  Finally,  conclusions  are 
drawn  in  Section  5. 


2.  Overview  of  PISCES 

PISCES  is  a  large  and  complex  application  which  solves  problems  strategic  to  the  semi¬ 
conductor  industry.  The  code  encompasses  many  areas  of  research  in  both  physics  and  computa¬ 
tional  mathematics.  It  is  a  two-dimensional,  two-carrier  semiconductor  device  modeling  program 
developed  primarily  at  Stanford  University  over  the  past  fifteen  years  and  is  well  known  through¬ 
out  the  semiconductor  industry.  The  program  is  available  from  Stanford  or  from  one  of  several 
Technology  CAD  vendors  who  support  the  code  commercially.  To  date,  there  are  more  than  one 
thousand  industrial  and  academic  users.  PISCES  solves  Poisson’s  equation  and  the  semiconductor 
continuity  equations  below: 


V  (eV'P)  =  -q(p-n  +  trD-N-A)  -pF 
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Poisson’s  equation 
Electron  Contuniuty 
Hole  Continuity 


is  the  electric  potential,  and  n  and  p  are  the  election  and  hole  concentrations.  and  NA  are 
the  ionized  impurity  concentrations  and  pF  is  the  fixed-charge  density.  Jn  and  J p  are  the  electron 
and  hole  current  densities  and  Un  and  Up  are  the  electron  and  hole  recombination  rates. 

These  coupled  nonlinear  PDEs  are  discretized  using  a  finite  volume  formulation  on  a  non¬ 
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uniform  triangular  grid.  The  resulting  algebraic  equations  are  solved  using  a  nonlinear  iteration 
method.  The  coarse-grain  structure  of  the  simulator  is  shown  in  Figure  1.  This  program  structure 
is  typical  of  many  grid-based  nonlinear  PDE  solvers.  The  user  interface  (UI)  parses  input  files, 
performs  file  I/O  functions,  and  displays  simulation  results  via  plotting  utilities.  The  nonlinear 
solver  supports  a  wide  range  of  solution  techniques  and  contains  both  a  fully-coupled  Newton  [4] 
solver  and  staggered  Gummel  [5]  scheme.  PISCES  offers  a  large  and  diverse  set  of  physical  mod¬ 
els.  Improvements  and  additions  to  the  physical  models  are  the  most  active  areas  in  the  continuing 
development  of  the  simulator.  During  the  nonlinear  solution  process,  a  sparse  linear  system  is 
assembled  in  a  triangle-by-triangle  fashion  using  a  subset  of  the  available  physical  models.  It  has 
been  shown  in  [6]  that  the  matrices  arising  in  this  type  of  device  simulation  are  extremely  ill-con¬ 
ditioned  and  not  easily  solved  using  iterative  techniques.  Therefore,  the  sparse  system  of  linear 
equations  is  solved  using  an  optimized  sparse  direct  solver  as  described  in  [7]. 


c 

PISCES  User  Interface  | 

i 

c 

Nonlinear  solver  | 

|  Linear  Salver 

|  |  Matrix  Formation  &  Assembly  | 

1 - ^ - 1 

Fig.  1: 

PISCES  Program  Structure. 

An  example  PISCES  simulation  grid  is  shown  in  Figure  2.  The  structure  represents  the  2D 
cross-section  of  a  CMOS  inverter.  This  coarse  grid  contains  roughly  1,600  nodes  and  3,000  trian¬ 
gles.  The  structure  was  created  using  standard  one  micron  layout  rules  and  was  derived  from  the 
model  of  an  SRAM  cell  produced  by  our  lab  in  conjunction  with  an  industrial  partner.  We  ran  a 
very  simple  simulation  to  compute  the  DC  terminal  characteristics  using  a  finer  mesh  than  shown 
(roughly  8,000  nodes  and  15,000  triangles)  for  improved  accuracy.  After  reading  the  grid  and  set- 
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Fig.  2:  Grid  structure  of  a  CMOS  inverter. 

ting  the  proper  parameters,  the  simulation  brings  the  power  supply  to  5V  and  then  ramps  the  input 
gates  from  OV  to  5V.  The  output  is  shown  in  Figure  3.  As  expected  in  an  inverter  circuit,  the  out¬ 
put  experiences  a  rapid  transition  from  5  V  to  OV  as  the  input  voltage  increases. 
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This  simple  simulation  required  more  than  two  hours  on  an  IBM  RS/6000  Model  560F, 
the  fastest  serial  computer  available  in  our  lab.  Using  a  slightly  finer  mesh  containing  nearly 
11,000  nodes  and  21,000  triangles,  the  simulation  required  more  that  four  hours.  Simulations  to 
perform  more  detailed  analyses  of  this  structure  can  easily  take  several  days  to  complete.  If  we 
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were  to  perform  multiple  simulations  as  part  of  a  manufacturability  study,  several  weeks  would  be 
needed.  A  significant  boost  in  the  performance  of  PISCES  will  be  needed  if  it  is  to  keep  pace  with 
current  and  future  demands. 

3.  A  Recipe  for  Parallelization  of  PDE  Solvers 

Before  blindly  initiating  the  development  of  any  parallel  application,  one  should  estimate 
the  expected  benefits  and  determine  if  the  expected  effort  will  be  rewarded.  We  know  that  parallel 
grid-based  PDE  solvers  have  had  measurable  success  [8-11].  However,  the  current  version  of 
PISCES  consists  of  approximately  40,000  lines  of  FORTRAN-77.  During  its  fifteen-year  life¬ 
span  several  generations  of  researchers  have  modified  the  code.  Due  to  this  continual  develop¬ 
ment,  the  code  has  become  unwieldy. 

Despite  its  inelegant  program  structure,  great  care  has  been  taken  to  validate  the  code  as 
well  as  to  improve  and  calibrate  die  physical  models.  In  fact,  industrial  users  have  invested  con¬ 
siderable  effort  calibrating  PISCES  with  experimental  data  to  further  improve  accuracy.  More¬ 
over,  its  long  lifetime  has  allowed  a  large,  experienced,  and  satisfied  user  base  to  be  established. 
These  users  understand  the  subtleties  of  the  simulator  and  have  developed  sophisticated  solution 
strategies  unique  to  PISCES.  Finally,  both  the  long  lifetime  and  large  user  base  mean  that  many 
bugs  and  inconsistencies  in  the  code  have  been  exposed  and  either  corrected  or  worked  around. 
Developing  a  new  application  would  force  users  and  developers  to  repeat  an  enormous  amount  of 
non-trivial  and  time-consuming  work  beyond  initial  code  development.  These  facts  make  strong 
arguments  for  parallelizing  the  current  code  provided  that  the  effort  is  not  too  great  and  the  paral¬ 
lel  code  maintains  consistency  with  the  original  serial  code.  This  requires  that  changes  to  the 
existing  code  be  minimized  and  localized.  We  can  not  fundamentally  alter  either  the  data  struc¬ 
tures  or  the  algorithms  if  the  value  of  the  code  is  to  be  preserved. 

To  insure  that  the  parallelization  can  be  done  quickly,  smoothly,  and  without  major 
changes  to  the  program,  it  is  crucial  to  select  an  amenable  parallel  programming  model.  The  two 
dominant  parallel  programming  paradigms  are  data-paiallel  [12]  and  SPMD.  Data-parallel  com¬ 
pilers  require  the  parallelism  to  be  made  explicit  in  the  data  structures.  Modifying  the  data  struc¬ 
tures  in  PISCES  to  be  compliant  with  this  model  would  require  significant  recoding,  precluding  a 
data-parallel  implementation  if  our  original  development  goals  are  to  be  met.  On  the  other  hand, 
the  SPMD  model  appears  to  provide  a  flexible  and  intuitive  framework  for  parallelization  of  grid- 
based  PDE  solvers  without  forcing  significant  changes  to  either  the  data  structures  or  the  code. 
Using  well-known  grid  decomposition  techniques  [13-17],  the  simulation  grid  can  be  broken  into 
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disjoint  pieces.  Each  processor  of  the  parallel  machine  can  run  a  copy  of  the  device  simulator  to 
solve  one  section  of  the  partitioned  grid.  Data  dependencies  will  exist  along  the  shared  boundaries 
between  partitions.  If  these  data  dependencies  can  be  satisfied  through  inter-processor  communi¬ 
cation,  a  global  solution  identical  to  the  serial  solution  can  be  computed.  This  approach  allows  the 
bulk  of  the  code  and  data  structures  to  remain  unchanged.  Using  this  technique,  we  have  parallel¬ 
ized  PISCES  for  distributed-memory  computers.  The  Intel  iPSC/860  hypercube  in  our  lab  was 
used  for  the  initial  development.  Once  the  parallel  application  was  completed,  we  moved  the  code 
to  the  Thinking  Machines  CM-5  to  verify  portability. 

All  of  the  work  described  below  required  roughly  eight  months  to  complete.  We  feel  that 
is  a  reasonable  amount  of  time  to  preserve  a  code  with  a  fifteen-year  history  while  giving  it  a 
vastly  improved  solution  capability.  A  step-by-step  review  of  our  scheme  for  porting  grid-based 
PDE  codes  using  the  SPMD  model  and  our  experiences  adapting  PISCES  using  the  model  is 
given  in  the  following  sections. 

3.1  Deciding  what  to  parallelize. 

It  was  observed  by  Lucas[18]  that  for  small  simulation  grids  (less  than  2000  grid  points) 
PISCES  normally  spends  more  than  95%  of  its  runtime  solving  the  coupled  nonlinear  device 
equations.  The  remaining  fraction  of  time  is  spent  in  the  UI.  Within  the  nonlinear  solver  roughly 
half  the  time  is  spent  computing  linear  solutions  to  obtain  updates  to  the  nonlinear  system.  The 
other  half  of  the  nonlinear  solution  time  is  spent  creating  initial  guesses,  forming  and  assembling 
the  Jacobian  matrices,  evaluating  physical  models,  and  applying  nonlinear  updates.  More  recent 
analysis  shows  nonlinear  solution  times  grow  to  be  more  than  99%  of  the  total  runtime  for  larger 
grid  sizes.  As  grid  sizes  increase,  the  linear  solver  scales  more  poorly  than  other  pans  of  the  code; 
nevertheless,  for  structures  containing  tens  of  thousands  of  grid  points,  15-20%  of  the  runtime 
remains  outside  of  the  linear  solver.  Amdahl’s  Law  [19]  tells  us  that  merely  adding  a  parallel 
solver  to  PISCES  would  yield  at  best  a  5x  speedup  for  many  realistic  grid  sizes. 

Restructuring  the  nonlinear  solver  and  all  of  its  requisite  routines  to  run  in  parallel  is  vital 
to  extending  the  utility  of  the  simulator.  However,  the  UI  routines  take  negligible  time  and  are 
inherently  serial.  In  order  to  accommodate  this  dichotomy,  we  propose  splitting  the  application 
into  a  front-end  program  to  handle  the  serial  UI  tasks  and  a  separate  parallel  program  containing 
the  actual  device  solver. 
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3.2  Separating  the  serial  code  from  the  parallel  code. 

Before  embarking  upon  the  parallelization  of  the  device  solver,  we  propose  to  bifurcate 
PISCES  into  two  serial  programs:  the  serial  UI  program  and  a  serial  solver  application  (SA).  This 
initial  program  decomposition  resolves  the  data  dependencies  between  the  UI  and  serial  SA,  sets 
up  a  communication  mechanism  between  the  two,  and  creates  a  coarse- grain  choreographer  to 
coordinate  their  interaction.  The  resulting  global  structure  of  the  simulator  is  shown  in  Figure  4. 
Since  managing  the  hypercube  already  taxes  the  control  processor  of  the  iPSC/860,  we  opted  to 
run  the  UI  program  on  a  remote  SUN  workstation.  The  serial  S A  was  put  on  one  node  of  the 
hypercube. 


Fig.  4:  PISCES  decomposition:  UI/SA  program  structure. 


After  splitting  the  code  between  the  two  programs,  we  identified  all  of  the  data  structures 
required  by  the  serial  SA.  Like  most  large  FORTRAN-77  programs,  PISCES  data  structures  are 
defined  and  passed  via  COMMON  blocks.  It  was  a  relatively  simple  task  to  identify  all  of  the 
COMMON  blocks  needed  by  the  serial  SA.  Preceding  computation,  the  UI  performs  a  wholesale 
transfer  of  necessary  COMMON  blocks  to  the  serial  SA  using  the  communication  routines 
described  below. 

To  facilitate  the  data  transfer,  we  created  a  communicational  package  using  TCP/IP  [20]  to 
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connect  the  two  programs.  TCP/IP  provides  a  very  slow  connection.  (The  data  transfer  rate  is 
roughly  65kB/s  in  our  network  environment)  However,  the  communication  package  can  easily  be 
replaced  when  faster  communication  resources  are  available.  The  main  complication  involved 
accommodating  the  differing  machine-word  byte  orderings  of  the  Intel  hypercube  and  the  SUN 
workstation.  In  our  communication  routines,  the  receiver  is  responsible  for  byte  swapping  data 
upon  receipt  To  simplify  this  procedure,  we  separated  the  data  into  the  five  types  used  by 
PISCES:  integer,  real,  double  precision,  character,  and  logical.  Similarly  typed  data  are  grouped 
into  large  blocks  for  transmission  between  the  UI  and  SA.  This  makes  the  swapping  operation 
simpler  and  more  efficient.  Once  properly  received  and  reordered,  data  values  are  loaded  into 
their  corresponding  COMMON  blocks.  In  this  fashion  the  data  transfer  is  isolated  from  and  trans¬ 
parent  to  the  original  code. 

A  coarse-grain  choreographer  to  coordinate  the  transfer  of  data  and  solution  commands 
was  also  necessary.  The  coarse-grain  commands  consist  of  a  series  of  bias  conditions  (each  bias 
condition  requires  a  separate  nonlinear  solution)  to  be  solved  by  the  SA.  The  UI  choreographer 
interfaces  with  the  main  dispatch  loop  of  PISCES  (the  routine  which  parses  user’s  commands  and 
calls  the  appropriate  subroutines)  and  intercepts  commands  that  require  remote  processing.  The 
UI  choreographer  then  transfers  the  data  necessary  for  the  computation  requested  followed  by  the 
actual  command.  It  then  awaits  notification  from  the  SA  choreographer  that  the  command  has 
been  executed  and  receives  the  results.  At  the  other  end,  the  SA  choreographer  waits  to  receive 
messages,  initiates  the  requested  action,  and  returns  the  results.  The  changes  to  the  original  code 
were  minimal  since  the  choreographer  is  entirely  external  to  both  the  UI  and  the  SA.  Only  the  dis¬ 
patch  loops  were  modified  to  interface  with  the  choreographer. 

3.3  Adding  Domain  Decomposition 

We  expose  the  parallelism  in  the  problem  through  grid  decomposition.  Naturally,  the  first 
task  for  parallelization  was  to  add  a  domain  decomposition  module  (DDM)  to  the  UI.  At  runtime, 
the  user’s  grid  structure  will  be  given  to  the  DDM  which  will  then  spread  the  grid  among  the  pro¬ 
cessors.  An  example  of  grid-based  decomposition  is  shown  in  Figure  5.  The  CMOS  structure 
described  earlier  has  been  divided  into  sixteen  partitions.  The  partitioning  was  created  using  the 
recursive  spectral  bisection  (RSB)  algorithm  of  Pothen  et.  al.  [14].  The  goal  of  the  DDM  is  to 
divide  the  work  equally  among  the  processors  while  minimizing  the  communication  by  keeping 
boundaries  small.  We  chose  to  use  the  RSB  algorithm  as  the  heart  of  our  DDM.  The  DDM  con¬ 
sists  of  the  RSB  code,  a  pre-processor  to  convert  a  PISCES  grid  description  into  an  RSB  grid 
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Fig.  5:  16-processor  decomposition  using  Recursive  Spectral  Bisection. 


description,  and  a  post-processor  to  correct  any  inconsistencies  in  the  partitioning.  The  changes  to 
the  UI  code  were  minor,  a  call  to  the  DDM  after  the  grid  structure  is  loaded  produces  the  parti¬ 
tioning. 

RSB  was  designed  to  handle  homogenous  grids  (i.c.,  all  nodes  in  the  grid  have  equal  char¬ 
acteristics).  The  addition  of  boundary  conditions  in  the  form  of  external  electrodes  (Figure  6)  can 


Fig.  6:  Electrode  attached  to  5  grid  nodes. 


create  inconsistencies  which  must  be  corrected  by  post-processing  the  RSB  partitions.  The  elec¬ 
trodes  are  included  in  the  grid  description  passed  to  the  DDM;  however,  they  are  different  from 
other  nodes  in  the  grid  since  they  affect  the  assembly  of  all  nodes  connected  to  them.  If  the  set  of 
nodes  attached  to  an  electrode  is  spread  across  multiple  processors  by  the  domain  decomposer 
(Figure  7),  irregularities  will  occur  if  the  electrode  is  not  designated  as  shared  among  the  proces¬ 
sors.  This  can  easily  occur  since  the  RSB  cannot  guarantee  the  an  electrode  will  be  divided  among 
all  partitions  containing  attached  nodes.  The  post-processor  examines  the  partitioning  and  insures 
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that  all  processors  involved  with  nodes  attached  to  an  electrode  are  also  involved  with  that  elec¬ 
trode.  The  result  of  applying  this  operation  causes  electrodes  to  be  shared  across  all  involved  pro¬ 
cessors.  (Figure  8)  and  the  boundary  information  to  be  distributed  properly. 
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3.4  Parallel  Data  Transfer  and  Control 

Once  the  UI  has  the  ability  to  partition  grids  for  parallel  solution,  it  must  be  modified  to 
send  data  to  all  processors  in  the  parallel  partition.  Since  it  would  be  difficult  to  manage  a  separate 
TCP/IP  connection  to  each  processor  (some  machines  have  several  hundred  or  more),  the  UI  con¬ 
tinues  to  connect  to  only  one  processor,  nodej.  Data  to  other  nodes  is  sent  through  node_0  and 
forwarded  to  the  destination  via  the  parallel  message  passing  library.  The  Node  Executive  (NX), 
Intel’s  message  passing  library[21],  is  much  faster  than  the  TCP/IP  connection  making  the  extra 
hop  through  nodej)  relatively  inexpensive. 

Once  the  data  is  distributed,  we  can  begin  to  parallelize  the  computations.  We  intend  to 
have  each  processor  run  a  copy  of  the  SA  on  a  subgrid,  communicating  with  other  processors  to 
resolve  data  dependencies.  To  manage  this  communication  and  synchronization  we  designed  a 
fine-grain  choreographer  to  coordinate  the  processors  much  in  the  same  fashion  as  our  coarse- 
grain  choreographer  coordinates  actions  between  the  UI  and  SA.  Node_0  oversees  the  parallel 
action  by  sending  cues  the  other  nodes  to  initiate  communication,  synchronization,  and  computa- 
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tion.  The  structure  of  the  parallel  simulator  is  shown  in  Figure  9. 


JS3SUB 


Physical  Model  Evaluation 


PISCES  Vta  Internee 


Coane-grain  Choreographer 


UI -Multiprocessor 
Communication 


Coarse-grein  Choreographer 


Rue- grain  Choreograph: 


Nonlinear  solver 


Maura  Formation  A.  Assembly 


DMF  Linear  Solv 


Physical  Model  Evaluation 


Multiprocessor 

Communicstation 

Routines 


Nonlinear  toh  » 


Matrix  Formation  &  Assembly 


DMF  Linear  Solver 


Fig.  9:  Global  structure  of  the  parallel  simulator.  The  fine-grain 
choreographer  controls  execution  and  performs  all  com¬ 
munication  and  synchronization. 


3.5  Nonlinear  Solution 

The  nonlinear  solver  creates  an  initial  guess  at  the  solution  and  then  repeatedly  asks  for  a 
new  Jacobian  matrix  and  residual  vector  to  be  formed,  requests  the  linear  solution  of  this  system, 
and  applies  the  resulting  update  until  the  convergence  criteria  are  met  In  the  nonlinear  solver 
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module  itself,  only  the  initial  guess  routines  require  parallel  communication.  The  matrix  forma¬ 
tion  routines  and  the  linear  solver  reside  in  separate  modules  and  will  be  discussed  later. 

The  initial  guess  routines  need  non-local  information  when  the  creating  a  new  guess  at  the 
solution  based  upon  bias  conditions.  Once  again,  the  boundary  conditions  cause  problems.  For 
example,  to  compute  the  extrapolation  factor  for  a  projection  properly,  all  bias  conditions  must  be 
examined.  The  projection  is  scaled  by  the  differences  between  the  new  bias  conditions  and  the 
previous  values.  To  obtain  this  information,  it  is  necessary  to  have  a  global  picture  of  the  bound¬ 
ary  conditions.  We  evaluated  two  possible  solutions: 

•  Add  a  communication  phase  to  the  algorithm  to  make  all  processors  agree  upon  the  cor¬ 
rect  extrapolation  factors.  This  approach  has  the  disadvantages  of  adding  communication 
overhead  and  requiring  a  moderate  amount  of  code  modification. 

•  Add  extra  data  structures,  duplicated  on  each  processor,  which  contain  global  boundary 
condition  information.  This  has  the  disadvantage  of  requiring  additional  data  structures  to 
store  the  duplicated  global  information.  It  has  the  advantages  of  requiring  no  communica¬ 
tion  and  minimizing  the  code  changes.  (We  simply  modify  the  call  to  the  projection  algo¬ 
rithm  so  that  the  global,  rather  than  the  local,  boundary  conditions  are  passed.) 

We  chose  the  latter  solution  for  three  reasons:  existing  data  structures  were  unaffected,  the 
code  modifications  were  minimal,  and  the  amount  of  extra  data  was  small  and  did  not  impact 
memory  requirements  measurably. 

3.6  Matrix  Form  and  Assemble 

The  assembly  process  proceeds  in  a  finite  element  fashion,  evaluating  on  a  triangle-by-tri¬ 
angle  basis.  Each  triangle  forms  its  local  contribution  which  is  then  added  to  the  global  system.  If 
all  information  needed  during  assembly  is  local  to  a  triangle,  assembly  may  take  place  indepen¬ 
dently  on  all  processors.  This  is  usually  the  case;  however,  some  of  the  physical  models  do  require 
non-local  data.  Unlike  in  the  initial  guess  routines,  there  is  no  simple  way  to  avoid  communicat¬ 
ing  to  resolve  non-local  references  as  the  data  needed  may  be  constantly  updated  by  its  owner. 

Semiconductor  devices  often  contain  different  materials  within  a  single  structure.  One 
example  is  a  silicon-oxide  interface  (Figure  10)  common  in  MOS  structures  [9].  The  mobility  of 
the  carriers  along  these  interfaces  differs  from  the  mobility  in  bulk  silicon.  If  the  user  desires  to 
model  this  mobility  variation,  each  silicon  triangle  along  the  interface  must  have  knowledge  of 
the  material  type  and  present  electrical  parameters  of  its  neighbors  in  order  to  properly  compute 
mobility  at  the  interface.  We  were  forced  to  modify  this  model  so  that  interface  triangles  could 
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obtain  the  material  parameters  of  neighboring  triangles  on  remote  processors.  Before  evaluating 
this  mobility  model,  the  fine-grain  choreographer  initiates  a  communication  phase  to  exchange 
the  required  for  model  evaluation.  The  model  itself  was  modified  to  access  a  temporary 
buffer  containing  the  non-local  information.  By  performing  all  communication  beforehand,  code 
changes  were  again  minimized.  Fortunately,  models  of  this  type  arc  rare  in  the  current  release  ver¬ 
sion  of  PISCES. 


Si/Oxide  Interface 


Fig.  10:  A  simple  n-channel  MOS  structure. 


3.7  Linear  Solution 

The  linear  solution  code  is  distinct  from  the  rest  of  the  simulator  and  has  a  well-defined 
interface.  The  linear  solver  accepts  a  matrix  and  a  right-hand-side  vector  from  the  nonlinear 
solver  and  returns  a  solution  vector.  As  long  as  the  solution  is  correct  and  computed  efficiently, 
the  linear  solver  can  be  treated  as  a  “black  box”  by  the  rest  of  the  application.  The  serial  version  of 
PISCES  uses  a  public  domain  sparse  direct  solver[7].  Although  written  more  than  fifteen  years 
ago,  it  continues  to  be  quite  efficient  on  most  serial  architectures.  However,  adapting  this  code  for 
parallel  execution  is  problematic  as  the  algorithms  and  data  structures  do  not  map  well  to  modem 
distributed-memory  architectures.  Fortunately,  the  segmentation  between  the  linear  solver  and  the 
remainder  of  the  code  allows  us  to  swap  one  “black  box”  for  another  one  with  identical  behavior. 
The  replacement  of  the  linear  solver  had  no  effect  on  the  remainder  of  the  code. 

We  replaced  the  existing  solver  with  a  distributed  multi-frontal  (DMF)  solver  [23,  24] 
developed  in  our  lab  and  modelled  after  the  work  of  [  1 8].  Sparse  direct  methods  generally  contain 
two  phases:  a  symbolic  factorization  followed  by  a  series  of  numeric  factorizations.The  symbolic 
phase  first  reorders  the  equations  to  minimize  non-zero  fill-in  during  elimination.  This  also  mini¬ 
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mizes  the  floating  point  operations  to  compute  the  factor  and  the  memory  required  to  store  the 
factor.  Once  a  reordering  has  been  computed,  the  matrix  is  symbolically  factored  to  the  determine 
the  exact  non-zero  structure  of  the  factor.  Using  this  information,  the  requisite  data  structures  are 
created.  The  numeric  factorization  performs  the  actual  floating  point  work  using  the  elimination 


ordering  and  storage  prepared  by  the  symbolic  phase. 

In  parallel,  the  symbolic  and  numeric  factorizations  become  more  complicated  due  to  both 
the  constraints  on  elimination  order  posed  by  the  separators  and  the  communication  required  to 
factor  them.  RSB  creates  partitions  in  a  top-down  fashion  (Figure  11).  We  exploit  this  behavior 
when  computing  the  order  of  elimination  by  factoring  the  local  domains  independently  followed 
by  the  sets  of  RSB  separators.  The  sets  of  separators  are  factored  in  the  inverse  order  in  which 
they  were  created  by  RSB  (Figure  12). 


Node  0,1 

Node  2.  3 

local  domains 

local  domains 

Q  1st  level  separator 
2nd  level  separators 


NodeO  local  1 

domain  H 

|  Node  2  local 

1  domain 

Node  1  local 
domain 

Node  3  local 
domain 

Fig.  1 1 :  Creation  of  4  partitions :  The  first  separator  (level  1 )  creates 

two  partitions.  The  second  set  of  separators  (level  2)  doubles 
the  number  of  partitions  by  bisecting  the  previous  partitions. 


To  accommodate  the  constrained  order  of  elimination,  a  two-phase  reordering  is  used. 
First,  the  multiple  minimum  degree  (MMD)  algorithm  [25]  is  used  to  reorder  each  processor’s 
local  domain.  Afterward,  equations  within  the  separators  are  added  to  the  elimination  ordering. 
The  interiors  are  factored  independently;  but,  the  processors  must  communicate  to  factor  the  sep¬ 
arators. 

The  non-zero  fill-in  during  a  sparse  direct  factorization  adds  coupling  between  equations 
residing  on  different  processors.  To  insure  a  correct  linear  solution,  processors  must  have  knowl- 
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edge  of  some  equations  that  are  resident  on  other  machines.  The  UI  adds  the  extra  grid  points  that 
generate  these  equations  to  the  processors’  data.  These  grid  points  are  used  solely  by  the  linear 
solver  and  have  no  effect  upon  the  remainder  of  PISCES. 

3.8  Retrospective:  What  additional  permanent  data  structures  were  necessary 

In  the  previous  sections  we  have  described  the  procedures  we  followed  for  implementing 
our  parallel  PDE  solver.  To  provide  a  complete  picture  of  the  parallelization  process,  it  is  useful  to 
describe  the  major  data  structures  added  during  development.  All  of  these  data  structures  are  used 
by  the  choreographers.  They  arc  external  to  PISCES  and  have  no  effect  upon  the  existing  data 
structures.  They  are  summarized  below: 

•  local_to_global_map[  1  ..#local  grid  points]:  Each  processor  keeps  an  array  which  maps 
the  local  node  numbering  of  its  subgrid  to  the  global  node  numbering.  All  communication 
among  the  processors  uses  the  global  numbering  system. 

•  owner [l..#local  grid  points]:  The  array  used  to  determine  the  processor  in  charge  of  accu¬ 
mulating  updates  to  each  shared  grid  point  The  owner  is  responsible  for  communicating  a 
consistent  set  of  values  for  the  grid  point  to  other  processors  sharing  the  node. 

•  share _count[  1  ..#local  grid  points]:  This  array,  in  conjunction  with  share  list,  provides  a 
complete  picture  of  the  interactions  with  other  processors  for  each  the  shared  grid  point. 
The  data  is  stored  using  a  quotient  list  structure  [24],  For  each  grid  point,  share  count 
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points  to  the  starting  location  in  share  Jist  of  that  node’s  list  of  involved  processors  (Fig¬ 
ure  13).  The  number  of  processors  sharing  node  i  is  given  by  share_count[ i+1]  - 
share_count[ i). 

share_list[  1 .  .share_count[#points+ 1]] :  This  array  (Figure  13)  contains  lists,  one  for  each 
grid  point,  of  the  processors  sharing  a  grid  point  It  is  indexed  by  share  count. 


•  separator _level[l.Mocal  grid  points]:  Tells  which  level  separator  a  shared  grid  point 
belongs  in.  It  is  used  exclusively  by  the  linear  solver  as  described  earlier. 

•  Global  Electrode  Data:  This  is  a  complete  set  of  boundary  condition  data  for  the  global 
system  to  insure  consistency  across  the  processors  as  discussed  above. 

•  elem_owner[  1 . ,#global  triangles]:  This  array  is  used  solely  by  the  UI  program  to  deter¬ 
mine  the  proper  destination  for  each  triangle.  It  is  not  sent  to  the  SA. 

3.8  Porting  to  the  CM-5 

To  prove  that  our  methodology  creates  applications  that  can  be  easily  ported  to  other  par¬ 
allel  computers  supporting  the  SPMD  model,  we  moved  the  code  to  the  CM-5.  Although  the  CM- 
5  was  originally  designed  to  support  the  dataparallel  programming  model[26],  it  also  supports  the 
SPMD  programming  model  through  its  CMMD  message  passing  library[27].  Each  processing 
node  consists  of  a  SPARC  processor  controlling  four  floating-point  vector  units.  Unfortunately, 
the  vector  units  can  only  be  utilized  by  programs  written  using  a  dataparallel  language[28].  As 
declared  earlier,  we  were  unwilling  to  make  wholesale  changes  to  the  code.  This  meant  that  we 
were  confined  to  the  SPARC  processor  and  unable  the  access  the  high-performance  vector  units 
from  our  application.  However,  the  machine  still  offers  reasonable  performance  as  demonstrated 
in  our  results  section. 

The  porting  process  was  surprisingly  straightforward.  Most  of  the  work  involved  changing 
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the  UI/SA  communication  layer.  Since  the  CM-5  has  one  (or  more)  SUN  workstation  equivalents 
as  control  processors,  we  chose  to  run  the  UI  on  a  control  processor  rather  than  a  remote  worksta¬ 
tion.  The  control  processors  provided  acceptable  performance  and  provided  a  20Mb/s  connection 
connected  to  the  attached  parallel  machine,  a  considerable  improvement  over  TCP/IP  communi¬ 
cation.  Also,  the  host  can  communicate  directly  with  all  nodes,  making  the  intermediate  hop 
through  node  O  unnecessary.  The  host  uses  the  same  byte  ordering  as  the  parallel  processors, 
making  byte  reordering  unnecessary. 

Otherwise,  we  translated  NX  calls  into  corresponding  CMMD  calls.  In  most  cases,  a  sim¬ 
ple  wrapper  was  sufficient  and  the  existing  code  was  left  untouched.  In  rare  cases,  such  as  the 
hypercube  specific  cubedim()  function,  we  substituted  functions  more  appropriate  to  the  CM-5. 
The  porting  work  required  less  than  two  weeks  and  minimal  code  changes  were  necessary. 

4.  Results 

Using  the  new  parallel  device  simulator,  we  can  scale  the  computational  resources  to 
match  problem  requirements.  This  should  provide  not  only  the  ability  to  simulate  existing  grids  in 
less  time  but  also  the  ability  to  simulate  large,  complex  grids  which  arc  computationally  infeasible 
on  serial  architectures.  In  order  to  demonstrate  the  utility  of  our  new  parallel  application,  we  ran 
the  simulation  of  the  CMOS  inverter  structure  described  in  Section  2  (a  total  of  29  bias  steps)  on 
several  different  machines.  We  simulated  four  varying  grid  sizes  as  described  in  Table  1 .  They 


Table  1:  Simulation  grid  statistics 


2K  Grid 

8K  Grid 

1  IK  Grid 

18K  grid 

Number  of  grid  points 

1,607 

7,844 

10,728 

18,503 

Number  of  triangles 

2,963 

15,158 

20,873 

36,400 

Number  of  equations 

4,821 

23,532 

32,184 

55,509 

comprise  a  realistic  set  of  grids  to  study  various  characteristics  of  this  complex  device.  The  lower 
resolutions  are  suitable  to  model  the  macroscopic  behavior.  The  finer  resolutions  are  required  to 
capture  small-scale  effects  within  the  device. 

Only  the  CM-5  had  sufficient  memory  to  simulate  the  18k  grid.  The  iPSC/860  contains 
enough  total  memory  to  run  the  18K  grid  on  16  or  32  processors;  however,  a  poor  load  balance 
caused  a  small  subset  of  the  processors  to  run  out  of  memory.  In  the  following  tables,  an  “N.A.” 
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entry  indicates  that  insufficient  memory  existed  to  run  the  simulation. 


4.1  Serial  Timings 

To  provide  a  set  of  serial  benchmarks  for  comparison,  we  ran  the  simulation  on  the  three 
fastest  serial  computers  available  in  our  lab.  The  simulation  times  axe  shown  in  Table  2.  Not  sur- 


Table  2:  Serial  Solution  Times  For  CMOS  Inverter 


2K  Grid 

8KGrid 

1  IK  Grid 

SUN  4/670 

1,214s 

40,032s 

N.A. 

IBM  RS/6000  Model  530H 

417s 

11,263s 

23,822s 

IBM  RS/6000  Model  560F 

279s 

7,239s 

15,687s 

prisingly,  the  IBM  Model  560F,  the  fastest  floating  point  processor,  completed  the  simulations  in 
the  shortest  time.  All  of  the  comparisons  between  the  serial  simulator  and  the  parallel  simulator 
are  based  upon  the  corresponding  time  on  the  Model  560F.  The  other  solution  times  are  included 
for  reference. 

4.2  Intel  iPSC  Timings 

We  ran  the  parallel  code  using  the  SUN  4/670  as  the  front  end  for  our  32-processor  Intel 
iPSC/860.  For  the  small  number  of  bias  points  in  this  simulation,  the  setup  time  using  TCP/IP 
contributes  greatly  to  the  runtime.  For  example,  the  data  transmission  to  set  up  a  32-node  partition 
takes  roughly  fifteen  minutes.  Had  we  run  a  more  thorough  simulation,  the  setup  time  would  have 
been  better  amortized.  The  timings  shown  in  Table  3  include  all  setup  time.  To  provide  a  better 
understanding  of  the  asymptotic  behavior  of  the  parallel  code.  Table  4  contains  timings  with  the 
data  transmission  times  subtracted.  In  Figure  14,  we  have  plotted  the  best  serial  solution  time  for 
each  grid  size  as  well  as  the  best  iPSC  solution  times  both  including  and  discounting  the  setup 
communication  overhead. 


Table  3:  iPSC  Solution  Times  for  CMOS  Inverter 


2K  Grid 

8K  Grid 

1  IK  Grid 

iPSC/860  1PE 

1,515s 

N.A. 

N.A. 

iPSC/860  2PE 

829s 

N.A. 

N.A. 
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Table  3:  iPSC  Solution  Times  for  CMOS  Inverter 


2K  Grid 

8K  Grid 

1  IK  Grid 

iPSC/860  4PE 

635s 

N.A. 

N.A. 

iPSC/860  8PE 

597s 

2,687s 

3,611s 

iPSC/860  16PE 

799s 

1,900s 

3,231s 

iPSC/860  32PE 

1279s 

2,298s 

3,107s 

Table  4:  iPSC  Solution  Times  for  CMOS  Inverter  without  TCP/IP 


2K  Grid 

8K  Grid 

1  IK  Grid 

iPSC/860  1PE 

1,480s 

N.A. 

N.A. 

iPSC/860  2PE 

778s 

N.A. 

N.A. 

iPSC/860  4PE 

522s 

N.A. 

N.A. 

iPSC/860  8PE 

372s 

2,462s 

3,386s 

iPSC/860  16PE 

349s 

1,450s 

2,781s 

iPSC/860  32PE 

375s 

1,394s 

2,203s 

For  the  2K  grid,  the  iPSC  cannot  outperform  the  IBM  560F.  This  first  observation  begs  the 
question:  why  did  we  develop  a  parallel  simulator?  The  answer  is:  we  need  the  ability  to  solve 
existing  large  problems  quickly  and  the  ability  to  solve  larger  problems  which  are  computation¬ 
ally  infeasible  on  current  serial  machines.  For  small  problems,  the  overhead  of  setup,  communica¬ 
tion,  and  synchronization  may  make  some  simulations  run  slower  in  parallel  than  serially.  The  2K 
grid  is  such  a  problem.  However,  the  8K  grid  is  more  encouraging.  We  achieve  4x  speedup  over 
the  IBM  including  overhead  and  greater  than  5x  speedup  with  overhead  discounted.  The  1  IK  grid 
is  more  encouraging  still.  We  obtain  a  5x  speedup  with  overhead  and  a  7x  speedup  without  it. 

Figure  14  shows  that  as  grid  sizes  increase,  the  solution  times  of  the  parallel  code  increase 
much  more  slowly  than  the  serial  solution  times.  The  ability  to  scale  more  efficiently  with  grid 
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size  will  prove  significant  as  the  sizes  and  runtimes  of  simulations  grow. 


Fig.  14:  Graph  of  the  best  solution  times  for  the  IBM  560  ind  the 
iPSC/860 . 


4.3  CM-5  Timings 

We  ran  the  simulations  on  two  CM-5e’s.  (The  CM-5e  uses  the  newer  SUN  Viking  proces¬ 
sor  in  place  of  the  original  SPARC  processors.)  The  CM-5  manages  to  barely  outrace  the  IBM 


Table  5:  CM-5  Solution  Times  for  CMOS  Inverter 


2K  Grid 

8K  Grid 

1  IK  Grid 

18K  Grid 

CM-5e  32PE 

262s 

1,150s 

1,792s 

3,054s 

CM-5e  64PE 

270s 

832s 

1,232s 

2,094s 

560F  on  the  2K  grid.  For  the  8K  grid  and  the  1  IK  grid,  we  observe  speedups  over  the  IBM  of 
nearly  9x  and  13x,  respectively.  Finally,  with  the  18K  grid,  we  demonstrate  the  ability  to  simulate 
problems  computationally  infeasible  on  our  serial  machines.  The  ability  to  scale  computational 
resources  makes  this  simulation  possible.  Plotting  the  data  (Figure  15)  reinforces  our  assertion 
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that  the  parallel  simulator  offers  superior  scalability  as  problem  sizes  increase. 


Fig.  15:  Graph  of  the  best  solution  times  for  the  IBM  560F 
and  the  CM-5e. 


5.  Conclusions 

In  this  paper,  we  have  demonstrated  that  PISCES,  a  commercially- supported  “dusty-deck” 
PDE  solver,  can  be  adapted  for  parallel  execution  with  minimal  changes  to  the  existing  serial 
code.  In  this  fashion,  we  have  retained  the  value  captured  in  the  long-term  development  of  the 
program.  The  proven  portability  of  our  methodology  allows  the  code  to  easily  migrate  to  other 
distributed-memory  architectures.  We  believe  the  methodology  described  in  this  paper  should 
work  well  for  any  PDE  solver  with  a  similar  program  structure. 

Maintaining  the  original  code  and  data  structures  does  not  prevent  the  parallel  implemen¬ 
tation  from  providing  striking  reductions  in  simulation  times  for  realistic  large  grids.  We  have 
already  demonstrated  order  of  magnitude  (or  more)  improvements  in  simulation  times  for  current 
problems  as  well  as  the  ability  to  simulate  grids  too  large  for  our  serial  computers.  In  fact,  the 
behavior  of  the  parallel  code  as  grid  sizes  increase  suggests  a  vast  potential  for  simulating  large 
and  ultra-large  structures.  In  the  semiconductor  industry,  the  competitive  advantage  gained  could 
have  immeasurable  benefits. 
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Abstract— The  approximate  block  elimination  method  (ABE) 
for  constructing  pre-conditioning  matrices  in  the  block  itera¬ 
tion  of  systems  of  linear  equations  is  presented.  Four  families 
of  pre-conditioning  matrices  based  on  this  approach  are  iden¬ 
tified  for  solving  the  2  x  2  block  equations  with  the  Jacobian 
resulted  from  the  3-D  one-carrier  semiconductor  device  anal¬ 
ysis  problem.  The  fully  coupled  Newton’s  method  is  used  as  a 
test  case  on  a  parallel  computer,  the  Intel  iPSC/2  hypercube. 
There  are  10  different  pre-conditioning  transformations  includ¬ 
ing  the  alternate  block  factorization  (ABF)  and  the  modified 
singular  perturbation  (MSP)  schemes.  The  results  from  com¬ 
putational  experiments  indicate  that  eight  of  these  pre-condi¬ 
tioning  matrices  significantly  extend  the  convergence  region  of 
the  block  Gauss-Seidel  iteration  process  of  Newton  steps  and 
five  of  them  speed  up  the  convergence  with  a  factor  of  up  to 
more  than  an  order  of  magnitude. 


I.  Introduction 

IN  semiconductor  device  simulation  a  set  of  elliptic  par¬ 
tial  differential  equations  with  appropriate  boundary 
conditions  should  be  solved  approximately.  After  its  dis¬ 
cretization  by  using  the  finite  difference  method  (FDM) 
or  finite  element  method  (FEM)  a  set  of  nonlinear  alge¬ 
braic  equations  is  obtained.  Gummel's  and  Newton’s 
methods  are  frequently  used  to  solve  this  system  of  non¬ 
linear  algebraic  equations  [1].  Gummel’s  method  forms 
smaller  and  more  tractable  systems  of  equations;  for  very 
low  bias  voltages  it  shows  good  performance  and  is  even 
better  than  Newton’s  method  in  terms  of  the  total  CPU 
expense  [14,.  When  using  quasi-Fermi  variables  it  can 
converge  well  in  a  large  region  far  from  the  solution  [3], 
[4],  Newton’s  method  is,  however,  often  required  to 
overcome  slow  convergence  of  Gummel’s  method  in  the 
strongly  coupled  situation  such  as  high  biases. 

In  Newton’s  method,  a  system  of  linear  equations  with 
the  Jacobian  should  be  solved.  Generally,  the  size  of  this 
system  is  too  big  to  use  a  direct  method  in  the  3-D  sim- 
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ulation.  Even  when  it  is  based  on  some  recent  results  us¬ 
ing  parallel  computational  methods,  the  direct  solution 
suffers  from  both  performance  and  CPU  degradation  [2J. 
So,  it  should  be  solved  by  an  inner  iteration.  Block  Gauss- 
Seidel  (G-S)  iteration  is  an  obvious  approach.  Whole  so¬ 
lution  procedure  of  the  nonlinear  equation  set  is  referred 
to  as  Newton-Gauss-Seidel  method  [8],  Newton  step  for 
linearization  of  the  nonlinear  equation  set,  and  G-S  block 
iteration  for  the  linear  block  equations.  Its  primary  ad¬ 
vantage  comes  from  easy  implementation  and  reduced 
storage.  But  its  poor  efficiency  has  been  reported  [7],  [8]. 
The  results  in  [7]  show  that  the  block  G-S  works  well 
only  at  very  low  biases.  It  suffers  from  a  difficulty  similar 
to  Gummel’s  method  owing  to  the  strong  coupling  be¬ 
tween  the  subsystems  which  occurs  when  higher  bias  is 
applied.  This  means  that  Newton’s  method  using  the  block 
G-S  iteration  without  matrix  transformation  cannot  obtain 
any  advantage  under  conditions  when  Gummel’s  method 
fails.  The  reason  for  this  is  that  the  off-diagonal  blocks 
are  significant  in  comparison  with  the  diagonal  blocks  of 
the  Jacobian  [7]. 

This  suggests  that  the  Newton-Gauss-Seidel  method 
should  use  an  efficient  pre-conditioning  transformation  to 
reduce  the  off-diagonal  entries  as  much  as  possible  for 
improving  the  convergence  of  G-S  block  iteration.  Two 
pre-conditioning  strategies,  the  modified  singular  pertur¬ 
bation  (MSP)  method  [6]  and  the  alternate  block  factor¬ 
ization  (ABF)  technique  [8],  were  proposed.  Algebra¬ 
ically,  the  MSP  is  equivalent  to  a  left  pre-conditioning  of 
the  original  system,  and  the  ABF,  a  right  pre-conditioning 
(postconditioning).  The  ABF  approach  used  is  a  more 
general  pre-conditioning  technique  which  can  accommo¬ 
date  m  x  m  block  equations  while  the  MSP  is  only  suit¬ 
able  for  the  2  x  2  case.  The  success  of  the  MSP  encour¬ 
aged  further  development  and  algebraical  study  of  both 
the  MSP  and  ABF  techniques. 

In  this  paper,  a  more  general  approach  to  construct 
block  pre-conditioning  matrices  has  been  developed — we 
will  refer  to  it  as  the  approximate  block  elimination 
method  (ABE).  Both  ABF  and  MSP  methods  can  be  de¬ 
rived  from  the  ABE  method.  Section  II  discusses  how  to 
construct  the  pre-conditioning  matrices  by  using  the  ABE 
method.  The  justification  for  this  approach  is  discussed  in 
Section  HI.  Section  IV  presents  experimental  results  for 
comparison.  Some  conclusions  are  given  in  Section  V. 
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13.  Construction  of  the  PRE-CoNDtnoNiNG 
Matrices 

We  temporally  leave  die  semiconductor  simulation  in 
Sections  II  and  m  for  more  generality  and  convenience. 
In  this  section,  the  idea  of  the  ABE  is  presented  and  then 
several  examples  with  some  trail  of  our  background  prob¬ 
lem  will  be  given  to  illustrate  our  idea.  The  solution  of  a 
general  system  with  m  x  m  block  equations  is  considered. 
Suppose 
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where  x,  e  R"u  x2  e  ffi,  •  •  •  ,  xm  e  /£. 

As  the  first  step  in  the  ABE,  we  construct  an  approxi¬ 
mate  matrix  K  to  the  original  matrix  B.  One  of  the  re¬ 
quirements  for  this  construction  is  that  the  resulting  ap¬ 
proximate  equation: 
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To  get  any  one  of  three  eliminated  forms  of  K  from  K 
we  can  use  two  different  matrix  operations.  The  first  ma¬ 
trix  operation,  i.e.,  the  left-block  elimination,  means 

ClK  =  *  (2.7) 

where  CL  presents  the  block  elimination  matrix  from  the 
left  of  K.  The  other  operation,  i.e.,  the  right-block  elim¬ 
ination,  means 

KCR  =  K  (2.8) 

where  CR  presents  the  block  elimination  matrix  fromthe 
right  of  K.  Cl  and^Cjf  are  dependent  on  selection  of  ATs, 
i.e. ,  Ka ,  KbJ  and 

For  example,  by  using  a  left-block  elimination  matrix 
CL  the  approximate  equation  (2.2)  is  changed  into  the  fol¬ 
lowing  form: 


Kx  =  b 


(2.2) 


Kx  =  CLb 


(2.9) 


is  easy  to  solve.  The  second  requirement  is  that  the  ap¬ 
proximate  matrix  AT  is  as  close  to  2?  as  possible.  We  denote 
the  approximation  of  matrix  element  Bu  by  ATn,  and  sim¬ 
ilarly  for  the  other  terms.  The  approximate  matrix  K  can 
then  be  written  as  follows: 
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The  second  step  in  ABE  is  to  determine  a  block  elimi¬ 
nation  matrix  C'  which  changes  K  into  one  of  the  three 
forms  (/Ts)  listed  below: 


or 
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where  K  is  one  of  three  block  forms  given  in  (2.4)-(2.6). 
That  is  to  say  that  CL  can  eliminate  the  off-diagonal  blocks 
of  the  coefficient  matrix  K  in  (2.2)  completely  or  partly 
mid  let  matrix  K  become  one  form  of  the  block  diagonal 
Ka,  the  block  lower  triangular  Kb  and  the  block  upper  tri¬ 
angular  Kc.  It  means  that  CL  accelerates  the  block  G-S 
iteration  of  the  approximate  equation  (2.2)  extremely. 
Hence  it  can  be  expected  that  CL,  as  a  pre-conditioning 
matrix,  is  able  to  partly  recouple  the  original  equation 
(2.1),  i.e.,  reduce  the  off-diagonal  blocks  of  matrix  B  to 
a  certain  extent.  We  call  CL  the  left  pre-conditioning  ma¬ 
trix.  Similarly,  CR  is  called  the  right  pre-conditioning  ma¬ 
trix  here  or  the  postconditioning  matrix  in  [8] . 

Determination  of  the  block  elimination  matrices  CL  or 
CR  is  an  important  and  difficult  problem.  The  ABF  pro¬ 
posed  by  Bank  et  al.  [8]  gave  an  approach  for  finding  an 
alternate  blocking  by  using  a  permutation  matrix.  It  can 
handle  the  case  =  diag (By),  ( i,j=  1,  •  •  •  ,  m).  Un¬ 
fortunately,  it  is  hard  to  say  presently  whether  it  can  be 
suitable  for  general  cases.  However,  it  should  be  pointed 
out  that  the  difficulty  for  determining  CL  or  CR  has  been 
reduced  because  K  has  a  more  general  form,  i.e.,  it  con¬ 
tains  not  only  Ka  but  also  Kb  and  Kc. 

In  the  following  we  will  illustrate  the  procedure  through 
several  examples  and  point  out  some  important  consid¬ 
erations  along  the  way. 

Pre-Conditioning  From  the  Left 

First,  consider  the  construction  of  the  left  pre-condi¬ 
tioning  matrix  C  for  the  2  x  2  block  equation.  With  an 
approximate  matrix  K  of  the  form  (2.3),  denote  that  the 


block  elimination  matrix  C  is  of  the  foim: 


Qi  Q2 
_C2|  C22. 


We  now  multiply  both  sides  of  the  approximate  equation 
(2.2)  from  the  left  by  matrix  C: 


CKx  =  Cb  (2.10) 

which  gives 
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Assume  Cu  —  C22  =  /,  and  the  existence  of  AT  7-  If 
we  choose  to  change  K  into  the  form  given  by  Kc  and  let 
C12  =  0,  then,  we  have 
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and  (2.10)  becomes 
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Since  K2}  in  the  approximate  equation  (2.2)  has  van¬ 
ished  by  using  matrix  C,  as  seen  in  (2.12),  the  block 
C-S  iteration  needs  to  solve  two  block  diagonal  equation 
sets  of  (2.12)  sequentially  only  one  time,  i.e.,  an  iteration 
method  becomes  a  direct  method.  In  other  words,  matrix 
C  gives  the  block  G-S  iteration  maximum  acceleration  for 
solution  of  the  approximate  equation  (2.2).  Using  C  as  a 
left  pre-conditioning  matrix,  (2.1)  becomes 
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It  is  easy  to  see  that  (B2i  -  K2XKXXBU)  of  (2.13)  is 
equal  to  0  if  Af,,  =  Bu  and  K2X  =  B21.  But  under  this 
condition  it  is  likely  that  the  solution  of  £,7  is  difficult 
and  the  nonzero  structure  of  (B22  -  K2lKu'Bi2)  given  in 
(2.13)  will  be  changed  when  compared  with  B22,  owing 
to  the  fact  that  the  nonzero  structure  of  a  product  matrix 
is  usually  more  complex  than  that  of  the  factor  matrices. 
Therefore,  although  ideally  the  approximate  matrix  K 
should  be  the  same  as  the  original  matrix  B,  a  practical 
constraint  is  that  it  should  be  easy  to  invert  the  sub-blocks 
K\\  and  Af22. 

To  be  specific,  we  choose  the  approximate  matrix  AT, 
of  B  to  be  as  follows: 
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where  £),,  is  the  diagonal  part  of  Bn,  and  similarly  for 
the  other  terms.  We  should  then  choose  a  pattern  of  the 


block  elimination  form  Ka,  Kb,  and  Kc.  If  we  select  Kc  so 
as  to  recouple  the  off-diagonal  block  K2I  =  B2I,  then  the 
block  elimination  matrix  C\  can  be  obtained  easily  from 

(2.11): 
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The  pre-conditioned  block  equation  is  the  following: 
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The  block  G-S  iteration  can  be  then  employed  to  solve 
this  transformed  block  set  (2.16),  which  is  the  MSP  that 
was  presented  in  NUPAD  II  and  proved  to  be  a  rather 
good  acceleration  of  the  block  G-S  iteration  in  3-D  one- 
carrier  device  simulation  [6).  _ 

Choosing  a  block  elimination  pattern  of  Ka,  we  can  ob¬ 
tain  a  different  block  elimination  matrix  C2: 


C2 


I  —DnDn 
. ~B2\Du  I 


(2.17) 


The  transformed  approximate  equation  becomes 


Du  “  Dl2D2jB2x 

0 

■>Ci 

0 

Du  ~ 

-*2- 

b 1  —  Dx2D-2^b2 
-b2  —  B2lDu'bl . 

after  using  C2  as  a  block  elimination  matrix.  The  effects 
of  the  transformation  matrices  C,  and  C2  are  almost  the 
same,  as  demonstrated  later  by  several  device  simulation 
examples  in  Table  III.  The  reason  for  this  will  be  dis¬ 
cussed  in  Sections  III  and  IV. 

Other  left  pre-conditioning  matrices,  as  listed  in  the 
Appendix,  can  be  similarly  constructed  using  different  K 
and/or  K.  We  will  now  turn  to  discuss  how  to  construct 
the  right  pre-conditioning  matrices. 


Pre-Conditioning  From  the  Right 
As  an  alternative,  the  right  pre-conditioning  matrices 
can  also  be  constructed.  For  example,  again  choose  K\  in 
(2.14)  as  the  approximate  matrix  to  B  and  select  a  block 
elimination  pattern  of  Ka,  the  corresponding  block  elimi¬ 
nation  matrix  C3  is 

=  [  I  -Du'Dn' 

3  L  -D22xB2x  / 

Suppose  that  Cf 1  exists,  we  have 

Kxc3c;'x  =  b 


(2.18) 

(2.19) 


or 

(*,).?  =  b  (2.20) 


where 

(KOa  «  AT.Cj 


_  Ai  ~  AsAi'Ai  0 

0  A2  ~  fi2lA  1*^12  - 


(2.21) 

and 

>’  =  CT'x-  (2.22) 

C3  can  be  used  as  a  pre-conditioning  matrix  of  (2.1)  and 
the  coefficient  matrix  of  the  pre-conditioned  equation  is 

R  -  BC3 

Al  —  Bl2D22'B2]  fil2  ~  A.Al'Aj 
~  ,  (2.23) 

.S21  —  RnD&lhi  B22  ~  B2tDnDl2. 

So  we  must  solve 

Ry  =  b  (2.24) 

for  the  variable  y  first  then  get  x  from 

x  =  C3y.  (2.25) 

We  can  choose  K2  as  another  approximate  form  of  the 
matrix  B 


Dn  A2 

A  j  — 

-Al  A2- 


(2.26) 


Kyx  =  b 


(2.27) 


(2.23) 


approximate  matrices  are  formed  and  block  elimination 
patterns  are  selected. 

We  now  derive  a  pre-conditioning  matrix  as  an  example 
for  the  3  x  3  block  equation  after  detailing  for  the  2  x  2 
case  above.  Suppose  the  block  equation 

fill  fil2  fi|3  xi  ^1 

fix  =  S21  B22  fi23  ^2  =  b2  —  b  (2.30) 

-fi31  fi32  B33_  _X3_  _b3_ 

where  Xj  e  x2  e  R%,  and  x3  e  and  an  approximate 
matrix  of  B  in  (2.30): 


(2.31) 


We  multiply  both  sides  of  the  corresponding  block 
equation  with  coefficient  matrix  K3  from  the  left  by  using 
matrix  C  to  let  K3  become  a  block  upper  triangular  de¬ 
noted  by  Kc  in  (2.6).  That  means 


Ai 

A2 

a3 

A. 

D22 

As 

fi3. 

A2 

As 

CKyX  =  Cb 


and  then 


and  construct  several  pre-conditioning  matrices  from  the 
left  and  right  by  the  idea  similar  to  that  given  above.  Here, 
we  just  mention  one  of  them  as  the  ABF  named  by  Bank 
et  al.  [8]  to  illustrate  how  to  construct  the  ABF  by  our 
idea  which  extends  the  method  of  Bank  et  al.  Suppose  the 
existence  of  K2 1  and  select  K  to  be  the  identity  matrix  /, 
a  special  form  of  Ka.  Hence  take  C4  =  K^x  as  a  recoupling 
matrix  of  the  approximate  equation 


(K3)cx  =  Cb. 

Suppose  Cn  =  C22  —  C33  —  /,  C|2  =  C13  —  C23  =  0 
and  the  existence  of  Df/  and  (D22  -  AiAVA2)_1-  It  is 
easy  to  determine  the  rest  of  the  elements  in  C  and  they 
are 

C2,  =  -D2]Dn' 

C31  =  — ((B3|DnlD12  —  D32) (D22  —  D2xDulDx2)  ‘D2\ 

+  fi3l)A"l' 

(-32  =  (AiAi'Aa  —  A2  )(A:  —  A1A/A2) 

The  matrix  R  pre-conditioned  by  C  from  the  left  is 


fill  fi|2  fi|3 

R  =  CB  =  Bn  —  A.A.'Bn  B22  —  AiAi'Bi:  B23  —  D2XDXXBX3 

_fi3i  +  C31Bn  +  C32B2 1  B32  +  C31B,2  +  C32B22  B33  +  Cj.B,,  +  C32B23 

from  the  right  for  changing  the  variables.  We  have  If  we  take  the  approximate  matrix  of  B 


If  we  take  the  approximate  matrix  of  B  as 


■  r  A2C  — A2  Q 
C4  =  Kj1  —  n  o  n  „  (2.28) 

.  “Ai  Q  D\\Q. 

where  Q  =  (DnD22  -  AjAi)-1-  The  right  pre-condi¬ 
tioned  matrix  of  (2.1)  is  thus 

R  -  BC„ 

(B11A2  “  fil2Al)C  (B12A1  ~  BX]DX1)Q 

-(B21A2  —  B^D^Q  (B^Dn  —  B2iDx2)Q_ 

(2.29) 

Similarly,  other  pre-conditioning  transformations  for 
the  2  x  2  case  can  be  constructed  when  other  suitable 


A. 

D\ 2 

As 

A, 

A: 

A3 

A, 

A: 

As 

(2.32) 


and  choose  a  block  elimination  pattern  of  I.  Then  if  K T1 
exists,  K T1  can  become  the  pre-conditioning  matrix  named 
the  ABF  method  [8).  It  is  easy  to  find  that  our  method 
includes  the  ABF  and  greatly  extends  it.  Thereby  the  ABE 
provides  more  flexibility  for  better  using  the  features  of 
the  original  coefficient  matrix  B  in  the  construction  of  the 
pre-conditioning  transformations.  This  means  that  we 


have  more  choices  in  constructing  the  pre-conditioning 
matrices  in  two-carrier  device  simulation. 

Similarly,  we  can  construct  other  pre-conditioning  ma¬ 
trices  by  using  the  idea  above  if  the  number  of  the  blocks 
in  the  block  equation  is  more  than  3. 

HI.  Analysis  of  the  Idea  for  2x2  Block 

We  will  show  that  the  approach  of  constructing  the  pre¬ 
conditioning  transformation  matrices  mentioned  in  Sec¬ 
tion  II  is  reasonable  under  some  requirements  of  matrix  fi 
in  (2.1).  The  discussion  is  just  limited  in  the  situation  of 
2x2,  i.e.,  m  =  2  in  (2.1). 

Suppose  that  Bu  e  if?  x  R"x,  fi22  e  R%  x  R%,  and  the 
strict  diagonal  dominance  of  both  fi,,  and  5^.  If  fi,,  is 
written  as 


=  £>,,  +  (3.1) 

where  e,,  e  K[  x  R"x,  the  strict  diagonal  dominance  of  fin 
means 

2  |(c„)v|  <  |tf>„),|,  i-1.  •••,».  (3.2) 

j 


The  strict  diagonal  dominance  of  fi22  means  an  inequality 
similar  to  (3.2)  holds 

S  |(e22)y  I  <  I (^22)1 1  >  1  =  1,  •  •  •  ,  n  (3.3) 

j 


where  £22  —  fi 22  —  ^22  ^  fij  x  fij. 

If  using  C2  of  (2.17)  as  a  left  pre-conditioning  matrix 
of  (2.1),  we  have 


I  -dx2d£ 
-fi2iz>r,'  1 
1 

n\^u 


1 

Bn 

’*1" 

U21 

B22 - 

-*2- 

-] 


—BuDr1 


~ 0)2^22 
/ 


bx 

jl*2j 


and  can  then  rewrite  the  above  equation  as 


Rx  = 


®ll  ~  &X 2^22  *h\  Bx2  “  ^12^22lf22 
-B2\Dxxtu 


fi22  -  b2]du%2 


*1 

*2 


b\  ~  Dx2D22b2 

*  h  R  n-'h  (34) 

\_02  -  D2XUXX  0 1  J 

where  fi,2  =  B12  -  Dx2.  A  significant  thing  arises  in  the 
matrix  R  which  is  transformed  from  fi.  Let’s  look  into  R2X 
by  using  the  strict  diagonal  dominance,  as  shown  in  (3.2), 
offi,,: 

Iifi2lll«  =  l|B21^ll16|l  II ob 

S  IIB21 II®  l!^||1«l|  lla.  <  II  fi2 1 II  o»  •  (3.5) 


This  means  that  the  transformed  block  equation  (3.4)  has 
been  recoupled  by  using  the  left  pre-conditioning  matrix 
C2  to  some  extent.  The  smaller  IlDjVtulL  is,  i.e.,  the 
stronger  the  stric'  diagonal  dominance  of  fin  is,  the  better 
the  effect  of  the  left  pre-conditioning  transformation  will 
be.  The  block  G-S  iteration  for  solving  (3.4)  needs  to  be 


done  only  one  time,  in  the  extreme,  if  IlDjVenlL  =  0, 
i.e.,  fin  becomes  a  diagonal  matrix.  From  (3.5)  we  see 
that  the  strict  diagonal  dominance  of  fin  can  accelerate 
convergence  of  the  block  G-S  iteration  by  using  some 
suitable  pre-conditioning  transformations  of  the  ABE 
method. 

Besides,  if  fi|2  is  a  diagonal,  then  fiI2  =  Di2  or  Bx2  = 
0.  Under  these  conditions  we  have 

fi|2  =  ~Di2D22  (22  (3.6) 


R22  ~  ~  D\2  (3.7) 

and  a  relationship  Ilfi^lL  <  IIB12IL,  similar  to  (3.5), 
holds  because  of  the  strict  diagonal  dominance  of  B&.  It 
means  that  the  coupling  between  both  variable  sets  x,  and 
x2  is  further  reduced  and  in  turn  the  further  acceleration 
of  the  block  G-S  iteration  can  be  expected. 

In  3-D  one-carrier  device  simulation,  the  block  ma¬ 
trices  fin,  B^,  and  fi21  have  the  identical  seven-diagonal 
structure  resulted  from  the  finite  difference  scheme  for 
discretizing  the  equations.  It  can  be  seen  that  the  nonzero 
structure  of  R22  is  the  same  as  fi22,  and  it  is  beneficial  to 
solve  the  block  equation  with  the  coefficient  matrix  R-& 
resulting  in  a  saving  of  both  storage  and  CPU  time. 

The  effects  of  the  pre-conditioning  may  not  always  be 
beneficial  when  only  one  of  the  matrices  fin  and  fi22  is 
strictly  diagonal  dominant  or  fi12  *  0.  For  example,  sup¬ 
pose  that  only  fiM  is  strictly  diagonal  dominant  and  fi,2  * 
Dn,  a  left  pre-conditioning  matrix  is 


C5 


I  —DnD22 
0  I 


(3.8) 


Then  we  have  *2.  —  fi2i  and  R\2  —  — ^12^22  ^22*  1^  can 
be  seen  that  fi2)  remains  unchanged  in  the  pre-conditioned 
matrix  R;  and  if  fi22  is  a  row  zero-sum  matrix,  then  Rl2  is 
merely  Dn  redistributed  by  -D22t22.  This  means  that  the 
original  block  equation  (2.1)  does  not  benefit  from  the 
pre-conditioning  matrix  C5.  The  results  for  the  device 
simulation  case  (Tables  III  and  V)  confirm  the  analysis 
here  since  we  have  the  relationship: 

C5  =  C'u=  CL,  (3.9) 


Although  we  have  only  presented  the  heuristic  analysis 
on  the  pre-conditioning  matrices  of  C2  and  C5,  other 
transformations  can  be  analyzed  in  a  similar  manner.  Most 
conclusions  still  hold  whether  the  left  or  right  pre-condi¬ 
tioning  matrices  are  formed  from  an  approximate  matrix 
of  K 1  or  K2.  The  discussion  for  device  simulation  will  be 
given  in  detail  in  Section  IV  after  the  experimental  results 
are  presented. 

The  above  analysis  indicates  that  it  is  very  important  to 
carefully  choose  an  effective  pre-conditioning  matrix  to 
improve  the  performance  in  iteratively  solving  block 
equation  (2.1)  by  exploiting  the  characteristics  of  the 
original  coefficient  matrix  fi.  On  the  other  hand,  it  appears 
that  more  characteristics  in  the  original  Jacobian  can  be 
exploited  in  addition  to  the  strict  diagonal  dominance,  for 
example,  the  ABF  with  variables  n  and  p  has  demon- 


strated  promising  results  [8]  without  the  strict  dominance 
of  the  diagonal  blocks.  This  indicates  the  need  to  extend 
the  above  basic  analysis  of  the  ABE  method. 


IV.  Numerical  Examples  in  Device  Simulation 

Having  developed  the  underlying  matrix  techniques 
presented  in  Sections  II  and  m,  let  us  now  return  to  the 
semiconductor  problem.  For  the  sake  of  simplicity,  it  is 
assumed  that  the  electron  current  continuity  equation  is  to 
be  solved  together  with  Poisson’s  equation.  In  normalized 
form,  the  equations  are  given  by 

V  •  (eV^)  =  rt  -  p  +  Na  —  Nd 


V  •  /„  =  0 


(4.1) 


in  Q,  where  n  =  exp  (i£  -  4>n),  p  =  exp  (4>p  -  i)  and  Jn 
=  —qp„nV<kn.  We  assume  that  Q  is  a  simply  connected 
cube  region.  As  can  be  easily  verified,  the  normalization 
constants  are:  thermal  voltage  (JkT /q)  for  electrostatic  po¬ 
tential  4  and  quasi-Fermi  levels  4>„  and  4>p,  intrinsic  car¬ 
rier  concentration  n,  for  carrier  and  impurity  concentra- 
tions,  and  the  intrinsic  Debye  length  yJekT/(q1ni)  for  the 
spatial  dimension.  Boltzmann  statistics  are  assumed  as  can 
be  seen  in  the  formulas  for  n  and  p.  Tabulated  doping 
dependent  mobility  values  are  used.  At  present,  bandgap 
narrowing  effects  at  high  doping  concentrations  and  field 
dependent  mobility  are  neglected. 

The  simulation  domain  Q  is  approximated  by  a  3-D  rec¬ 
tangular  grid.  Equations  (4. 1)  and  (4.2)  are  discretized  by 
the  finite  difference  method.  The  numerical  experiments 
were  implemented  in  a  hypercube  multiprocessor,  the  In¬ 
tel  iPSC2™.  Our  recent  experimental  problem  has  been 
presented  in  detail  in  [6]. 

We  employ  Slotboom  variables  4>„  =  exp  (—</>„)  and 
=  exp  (<?>,,)  as  the  independent  variables  because  of  the 
known  advantages  [11],  [12],  [7],  To  increase  the  bias 
range  two  scaling  schemes  which  preserve  the  symmetry 
of  main  diagonal  blocks  in  the  Jacobian  are  used  (see  [6] 
for  detail).  The  block  equation  with  the  Jacobian  in  each 
Newton  step  can  be  written  as  follows: 


fix  = 

0*4, 

~A*  " 

'-F/ 

A*n*.  - 

_a*b_ 

One  primary  advantage  is  that  the  matrix  A++  in  block 
equation  (4.3)  is  strictly  diagonal-dominant,  symmetric, 
and  positive  definite,  and  A+^„  is  symmetric.  The  struc¬ 
ture  of  the  three  matrices  Aw,  A4„*„  and  A+^  in  the  Ja¬ 
cobian  of  equation  (4.3)  consist  of  seven  diagonals  with 
the  exception  of  D^n  being  a  diagonal.  Two  approxima¬ 
tions  to  the  Jacobian  B 


and 


0** 

0* 4. 

d*a. 

0** 

0*4. 

04,4. 

(4.4) 


(4.5) 


are  used  to  construct  the  pre-conditioning  matrices.  We 
denote  C[a  as  a  left  pre-conditioning  matrix  which  trans¬ 
forms  Kt  into  the  form  of  (AT,)a  in  (2.4).  For  example, 
with 


C'Kc  = 


/  0 
I. 


we  have 

and 

BC'Rc 


0**  -  D<lA.DZ,<>„A+'4 
0 


L'W  ~  £>*„*„  A*.*  A*.*, 


(4.6) 


(4.7) 


(4.8) 


So  we  can  obtain  6  different  transformation  matrices  such 
as  Cia,  C'u,,  C'u,  C*,.  and  C'Rc  from  the  approximate 
matrix  ATt,  and  similarly  another  six  from  the  approximate 
matrix  K2-  It  should  be  noted  that  Cjj,  =  Ch,  and  = 
C\t,.  Therefore,  there  are  total  of  ten  different  pre-condi¬ 
tioning  matrices  and  the  corresponding  pre-conditioned 
matrices  (see  the  Appendix). 

We  have  tested  the  block  G-S  iteration  convergence  of 
these  pre-conditioned  matrices  evaluated  at  the  first  non¬ 
linear  Newton  iteration  of  the  device  simulation.  These 
tests  were  made  at  several  combinations  of  bias  voltages 
for  Vf,  and  K*.  In  the  block  G-S  iteration,  we  employed 
the  ICCG  algorithm  [9]  for  the  solution  of  the  symmetric 
matrices  and,  the  ILU-CGS  [10],  [6]  and  the  ILU- 
GMRES  [11],  [6]  for  the  asymmetric  matrices.  We  sum¬ 
marize  the  numerical  results  using  the  following  four 
comparisons. 

1)  The  Convergence  Comparison  of  ABF,  MSP,  and 
Plain  G-S:  The  values  in  Table  1  indicate  iteration  num¬ 
ber  k  when  both  following  convergence  criteria  are  satis¬ 
fied: 


A^k  -  Atf-* 


Aif/* 

-  A4>* 


A*! 


<  10“3 
<  10'3 


(4.9) 

(4.10) 


where  Ai/'*  and  A4>*  are  the  solutions  exact  enough,  i.e., 
those  with  error  less  than  10  "6  in  the  convergent  block 
iteration. 

Table  I  shows  the  convergence  of  three  block  G-S  it¬ 
erations.  For  simplicity  of  the  table,  the  ABF  method  [8] 
is  expressed  as  A,  the  MSP  method  [6]  as  M  and,  the  plain 
G-S,  which  does  not  accept  any  pre-conditioning,  as  G. 
Row  voltages  are  stepsize  is  0.5  V,  and  column  volt¬ 
ages  are  Vgi.  Obviously,  the  convergent  region  of  the  ABF 
and  the  MSP  is  much  larger  than  that  of  plain  G-S.  As 
shown  in  the  Table  I,  all  iteration  numbers  of  the  ABF 
and  MSP  are  always  limited  within  ten,  while  those  of  the 
plain  G-S  are  over  tens  even  more  one  hundred  except  V* 
=  0.5  V.  So,  the  convergence  rate  improvement  of  the 


TABLE  I 

The  Convergence  Comparison  of  ABF,  MSP  and  Plain  G-S  at  the 
First  Nonunear  Newton  Iteration  (A:  ABF,  M:  MSP  and  G:  Plain 

G-S) 


0.5 

1.0 

1.5 

2.0 

2.5 


2  2  2 

2  2  2 

3  3  2 

5  5  5 

5  5  5 

22  142  x 

5  7  6 

4  5  5 

41  77  >  150 

5  6  6 

5  5  6 

53  >130  >150 

4  5  6 

4  6  6 

63  97  117 


2  2  A 

2  2  M 

3  3  G 

6  6  A 

6  6  M 

x  x  G 

6  6  A 

6  6  M 

x  x  G 

6  6  A 

6  6  M 

x  x  G 

7  7  A 

6  6  M 

>150  >150  G 


ABF  and  MSP  can  be  as  large  as  more  than  an  order  of 
magnitude  in  comparison  with  the  plain  G-S.  Another  im¬ 
portant  fact  is  that  the  iteration  numbers  of  the  ABF  and 
MSP  increase  slowly  tfhen  bias  voltages  grow. 

Table  II  summarizes  part  of  the  results  from  Table  I  to 
give  a  deeper  impression  about  the  convergence  of  the 
plain  G-S  iteration.  We  use  three  simple  symbols  V,  *, 
and  x  to  illustrate  the  different  convergence  of  it  in  im¬ 
ages.  The  symbol  V  indicates  that  the  iteration  number  is 
competitive  with  the  ABF  and  MSP,  *  represents  the  it¬ 
eration  whose  number  exceeds  20,  and  x  divergent. 

Table  II  shows  that  plain  G-S  iteration  is  very  poor  in 
the  Newton’s  steps  used  for  device  simulation.  It  can  con¬ 
verge  rapidly  only  for  a  very  small  Kgs  region,  and  its 
convergence  performance  deteriorates  quickly  as  Pgs 
grows.  In  contrast,  the  results  in  Tables  I  and  II  show  that 
the  ABF  and  MSP  can  work  well  whereas  the  plain  G-S 
method  shows  its  poor  convergence. 

2)  The  Comparisons  of  Four  Pre-Conditioning 
Families:  We  give  the  convergence  comparisons  of  four 
pre-conditioning  families  under  the  same  conditions  stated 
at  (4.9)  and  (4.10). 

It  is  easy  to  see  from  Tables  UI-VI  that  all  of  the  pre¬ 
conditioning  transformations  except  C]^  and  C2U  improve 
the  convergence  of  block  G-S  iteration  more  or  less.  Es¬ 
pecially,  the  transformations  Cl,,  C'u  (MSP),  C*,,  C«, 
(=  C«,)  and  C* a  (ABF)  demonstrate  very  high  conver¬ 
gence  speed.  If  noting  that  AH  =  of  (3.1)  and  so  on, 
it  is  clear  that  the  relationships  ||D^UWIL  <  1  and 
<  1  are  held  because  of  the  strict  diagonal 
dominance  of  A According  to  (3.5),  we  observe  that 
one  of  the  two  off-diagonal  blocks  in  those  five  corre¬ 
sponding  pre-conditioned  matrices  (see  the  Appendix)  was 
reduced  by  D^A^  or  A++D^l  and  it  improves  conver¬ 
gence  of  the  original  matrix  B. 

The  effect  of  C'u,  (=  Cjj,),  C*.,  and  C\c  is  not  so  good 
as  the  five  kinds  mentioned  earlier.  The  reason  for  this  is 


TABLE  II 

The  Convergence  of  Plain  G-  Iteration  (V :  Competitive  with  ABF 
and  MSP,  *:  Iteration  No.  over  20,  x :  Divergent) 
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1.5 

* 

* 
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X 

X 

2.0 

* 

• 

* 

X 

X 
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• 

* 

• 

* 

• 

TABLE  III 

The  Convergence  Comparisons  of  Pre-Conditioning  Matrices  CL, 
CL,  and  C'u.  This  Left  Pre-Conditioning  Family  is  Formed  by 
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31 
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29 
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4 

5 

6 
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TABLE  IV 

The  Convergence  Comparisons  of  Pre-Conditioning  Matrices  CL. 
CL.  and  C)t.  This  Right  Pre-Conditioning  Family  is  Formed  by  K,  ' 
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TABLE  V 

The  Convergence  Comparisons  of  Pre-Conditioning  Matrices  C\,, 
CL-  and  CL-  This  Left  Pre-Conditioning  Family  is  Formed  by  K2 
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that  neither  of  off-diagonal  blocks  in  corresponding  pre¬ 
conditioned  matrices  was  reduced  significantly  because 
A+a.  is  just  a  row  zero-sum  matrix  and  not  a  strictly  di¬ 
agonal  dominant  one.  The  limitation  of  C'u,  (  =  Ch,),  i.e.. 


Cj  in  (3.8),  was  discussed  in  Section  m.  From  the  Ap¬ 
pendix  it  is  to  be  found  that  the  situation  for  and 
C|c  is  almost  the  same  as  C'u,  (=  C|>). 

We  observe  that  c£,  and  C2U  show  worse  convergence 
than  the  plain  G-S  iteration  does  even.  It  should  seem  that 
the  off-diagonal  block  (A*^,  -  D^D^A^)  in  the  ma¬ 
trices  Ru,  and  R*u  (shown  in  the  Appendix)  pre-condi¬ 
tioned  by  CL  and  C2U  plays  a  bad  role  in  the  block  iter¬ 
ation.  Although  the  diagonal  elements  of  (AM  - 
D^D^A^,)  were  canceled  maybe  some  off-diagonal  en¬ 
tries  of  it  were  enlarged  greatly. 

It  should  be  pointed  out  that  the  results  for  C&,  shown 
in  Table  VI  came  from  using  the  pre-conditioning  trans¬ 
formation  Co,  in  the  Appendix  exactly,  not  C4  in  (2.28). 
We  can  say  that  the  ABF  [8],  whose  pre-conditioning  ma¬ 
trix  is  given  by  C4,  proposed  by  Bank  et  al.  is  normal 
because  the  diagonal  elements  of  the  pre-conditioned  coef¬ 
ficient  matrix  become  1.  The  computational  results  for 
several  block  equations  with  Jacobian  indicate  that  the  dif¬ 
ference  between  the  normal  ABF  and  the  transformation 
C&,  can  be  ignored  within  the  error.  But  the  total  expense 
of  the  normal  ABF  in  per  block  iteration  is  much  more 
than  the  C&,  because  there  are  two  asymmetrical  main  di¬ 
agonal  blocks  and  not  one  in  the  pre-conditioned  matrix 
of  the  normal  ABF.  The  CPU  time  for  solution  of  an 
asymmetrical  equation  often  is  four  times  more  than  the 
symmetrical  one. 

3)  The  Direction  Convergence  Comparison:  The 
damping  technique  in  Newton's  steps  is  used  to  guide  the 
Newton’s  iteration  to  convergence.  This  means  that  in 
solving  the  linear  equations  (4.3)  occurred  in  a  damping 
Newton  step,  it  is  more  important  to  get  the  accurate  di¬ 
rection  of  the  update  vector  than  its  accurate  magnitude. 
We  show  the  direction  convergence  for  comparing  the 
ABF  with  the  MSP  in  two  examples.  The  direction  con¬ 
vergence  means 

(DRT)$  =  cos  (Mk,  At*)  (4.11) 

(DRT)*„  =  cos  (A#,  A*„*)  (4.12) 

where  (A^\  A\p*)  is  the  angle  between  Atk  and  At*, 
which  has  the  same  meaning  as  (4.9),  and  similarly  for 
(4.12). 

We  can  see  that  (DRT)$  of  the  MSP  often  is  larger  than 
(DRT)J,  of  the  ABF  from  Tables  VII  and  VIII.  In  the  fact, 
the  feature  above  came  true  at  most  combinations  of  V%i 
and  VAi.  We  found  that  the  MSP  often  took  less  CPU  time 
than  the  ABF  at  damping  Newton  method  because  of  the 
special  importance  of  At  at  each  Newton’s  step. 

4)  The  Effects  of  K\  and  K2:  Tables  III— VII  indicate 
that  the  closer  to  the  original  coefficient  matrix  B  the  ap¬ 
proximate  matrix  K  is  the  better  the  pre-conditioning  ef¬ 
fect  is.  As  the  final  comparison,  we  summarize  the  results 
of  Tables  BI-VI  by  using  three  symbols  V,  *  and  x  to 
illustrate  the  different  effects  of  both  approximate  ma¬ 
trices  A]  and  K2  in  Table  IX.  Symbol  v  represents  the 
particular  pre-conditioning  matrix  C  which  greatly  im¬ 
proves  the  convergence  of  the  block  G-S  iterations,  while 


TABLE  VI  V 

Convergence  Comparisons  of  Pre-Conditioning  Matrices  CL-  Cl. 
and  Cj*.  This  Right  Pre-Conditioning  Family  is  Formed  by  A2 
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TABLE  VII 

Direction  Convergence  Comparison  of  ABF  and  MSP  When 
V„  =  1.5  V.  =  1.5  V 
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TABLE  VIII 

Direction  Convergence  Comparison  of  ABF  and  MSP  When 
«  2.5  V,  V*  =  2.5  V 


k 

(DRT)t 

(DRT)i. 

1 

0.854619 

0.989042 

0.999390 

0.999444 

2 

0.995715 

0.998595 

0.999996 

0.999907 

3 

0.999661 

0.999826 

0.999999 

0.999997 

4 

0.999975 

0.999986 

1.000000 

1.000000 

5 

0.999998 

0  999999 

1.000000 

1000000 

6 

1.000000 

1.000000 

1.000000 

1.000000 

Method 

ABF 

MSP 

ABF 

MSP 

TABLE  IX 

Comparison  of  the  Effects  of  the  Approximate  Matrices  A,  and  K, 
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*  represents  that  pre-conditioning  matrix  C  which  im¬ 
proves  the  convergence  somewhat.  The  x  indicates  that 
C  has  actually  degraded  the  convergence. 

We  can  see  from  Table  IX  that  the  overall  effect  of  us¬ 
ing  X]  is  better  than  that  of  using  K2.  K]  contributes  four 
excellent  acceleration  transformations,  while  K2  gives 
only  two.  Kx  is  always  able  to  improve  the  block  G-S 
iteration,  but  K2  not.  Since  X,  is  closer  to  B  than  K2,  it  is 
no  surprise  that  the  above  fact  is  observed. 


V.  Conclusions 

We  have  presented  a  new  approach,  ABE,  to  construct 
the  pre-conditioning  transformations  for  block  equations 
and  shown  its  numerical  results  for  device  simulation.  Our 
primary  analysis  in  Section  EC  can  coincide  with  the  com¬ 
putational  results.  It  is  known  that  the  effect  of  the  pre¬ 


conditionings  is  good  if  the  features  such  as  strict  diago¬ 
nal  dominance  of  the  main  diagonal  blocks  and  simplicity 
of  the  off-diagonal  blocks  can  be  exploited.  Therefore,  we 
should  choose  one  or  two  of  the  best  transformations  based 
on  analysis  or  calculating.  Further  work  involving  the 
construction  of  the  needed  matrices  and  a  complete  the¬ 
oretical  analysis  is  still  under  development. 


Appendix 


cL  = 


R  L  = 


cl  = 


Rij,  — 


Cl  = 


Rl  = 


cl  = 


RL  = 


cl  — 


Rph  — 


Cl  = 


Rl  = 


r 2  _ 

^  La  ~ 


Rl  = 


r  2  _ 

Lh  ~ 


1 

-  A^W^W  f 


Aw  — 

Aqw^'W  Aw  A^^n  Aq^D^D^Q' 


1  -D+J*,, 

.0  I 

Aw  ~  A*,*. 

-A*. 4, 


~A*aDj, 


~A+  ^  A*, 


~jDjJ  D, 


A^V  ^4#r,^QnQnAQm4,  A^D^ 


A w  ~Aw&w 

A$^  A$n$n  A^^Dw^w* 


Aw  ~  £>**„£>*,*„^*„*  Dw„ 


—Dja.D, 


~D*,4,Du 


A*„4  ~  D$u Dw Aw  A+, ^  —  D^^DwDw, 


R 2 


Lb 


cl 


/  O' 

—Da.JDxl  l. 


R2 


Lc  ~ 


AW  D++, 

A*.*D*.+D^1  A++  /t*,*,  -  D^D$D„. 


cl 


-Dil  a 


-DZlDi 


W"**. 

I 


R2 


a i 


C2 


Rb  ~ 


R2 


Rb  ~ 


Cl 


R2Rc 


*.*."*<* 

AW  ~  D^'Dili'Di't  -A^DllD^, 

-A*.t  ~  A*.*.D^*.D*.i  "  A*.+D++D**. 

1  —D+l  D**„ 

0  / 

Aw>  ~AwD+l  D+tn 
L-^w  A*<*.  ~  D^„ 

I  O' 

-DiU,D^  /. 

Aw  ~  D^,Dil„D^  £>*♦. 

L^*.*  ~  A*A.D*l*.D+m  + 


r 

% 

i 


where  A ^  D^,  aga*  —  a+a*  D+^„,  matrices 

Cl  and  others  are  defined  in  Section  IV,  Rl  =  C\JB, 
and  similarly  for  the  others. 
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